This invention relates to fixed to variable length word encoding and to variable to fixed length word decoding. More particularly, the invention relates to mechanisms for resolving ambiguities when transitions between substrings of different alphabets occur. This enables the assignment of an optimum variable length word in encoding. It further enables the parsing of comma free bit streams of variable length code words and for resolving ambiguities between said variable length words and their fixed length representations in decoding.
Fixed to variable length encoding consists of assigning a variable length code word c(a) to the appearance of a fixed length codeword b(a). Each fixed length code word represents a character "a" in a source alphabet A.
Since b(a) can repreent both a character in A.sub.1 and A.sub.2, ambiguity arises in both the encoding and decoding processes.
The encoder must determine when a transition from one alphabet to another has occurred to make an optimum variable length code assignment, c(a). The decoder must determine which alphabet was selected by the encoder to parse the comma free bit stream and to properly reassign the fixed length code word b(a) to the received variable length code word c(a).
There exists cost and performance advantages in the machine storage, transfer and manipulation of character alphabets using equal fixed length codewords. Among the advantages are uniformity and standardization of storage cell/register sizes, the number of conductors for data busing, and the reduced or non-existant informational overhead to track character boundaries. However, variable length representation is attractive for transmission and storage, where the average compressed codeword length may be less than the fixed code word length.
English text may be machine described by fixed length code words from several different alphabets. For example, there can exist an alphabet of upper case characters A, B, . . . , Z; an alphabet of lower case characters a, b, . . . , z; or an alphabet of numbers and symbols 1, 2, 3, . . . , %, +, etc. It is possible to construct a fixed length code of length L whose capacity are =2.sup.L .gtoreq. T = the number of upper case characters plus the number of lower case characters, etc. However, where 2.sup.L .ltoreq. T, then some fixed length codewords b(a) will be ambiguous. Indeed, the cost tradeoffs may be such that the increase in length L of b(a) may be far more expensive than the use of mechanisms for resolving the ambiguities.
As previously mentioned, fixed to variable length encoding becomes attractive where the variable length codeword representation is a compressed version of the former. Compression is achieved by employing certain statistical regularities connected with the source alphabet. The most often used regularity is the ordering of characters on a relative frequency of occurrence and assigning the shortest length codewords to the most frequently occurring characters. This can lead to the rarest occurring characters having very long codewords. The upper limit of codeword length is that of the register size. To avoid the necessity of long registers, those infrequent characters whose variable codewords require more than a fixed register's length would be transmitted with a specific variable length prefix followed by the character in the clear, i.e., not encoded. This means that the encoder output consists of frequent variable length codeword sequences and infrequent fixed length words with the special prefix.
The encoder output can be viewed as a serial, comma-free bit stream insofar as the variable length words are concerned. Placing commas, separations, between the variable words would sharply reduce compression advantage.
The prior art is replete with many examples of fixed and variable length encoders. For example, Blasbalg, U.S. Pat. No. 3,237,170, describes an adaptive compactor that in effect varies the variable length code word assignments as the statistics of the relative frequency of occurrence of the source alphabet change. Wernikoff, U.S. Pat. No. 3,394,352, applies each fixed length word in parallel to differently structured encoders. He uses the shortest codeword from among the plurality with a tag to permit decoding.
Although Blasbalg changes his output codewords, he still preserves a unique one-to-one relationship between each input and output character. The same can be said of Wernikoff. In the latter case, the output tag identifies the encoder/decoder to be used. In contrast, the problem addressed by this invention is that of resolving ambiguous terms, first at the encoder and then at the decoder.