In distributed linked file systems like the World-Wide Web on the internet, there is frequently a need to store large amounts of information written in natural languages (such as English or German) as plain text in server systems and then to transmit that text information to other server or client systems efficiently. Additionally, there is a requirement to be able to perform full-text searches on all or part of the material stored quickly and efficiently either in client or server computers. These requirements exist not only in hypertext systems like the World-Wide Web on the internet, but also in distributed information query and retrieval systems or in database systems that accommodate storage of long text streams. Present methods of data compression that operate uniformly on all binary stored information are not necessarily well suited to supporting these long text streams both in terms of compression and decompression efficiency.
There are a number of conventional compression schemes, for example the compression scheme disclosed in U.S. Pat. No. 5,099,426 to Carlgren et al. hereby incorporated by reference herein. While conventional systems such as that disclosed in U.S. Pat. No. 5,099,426 use word tokenization schemes for compression, they suffer from several inefficiencies that make them less suitable for distributed systems use. In conventional systems, tokens (word numbers) assigned to each unique word in the text are determined by processing the specific text to be encoded and developing a table that ranks the words by frequency of occurrence in the text. This document specific ranking is then used to assign the shortest tokens (typically 1-byte) to words having the highest frequency of occurrence and to assign longer tokens to the less frequently occurring words.
While conventional encoding achieves a high degree of compression it creates several other inefficiencies, particularly in a distributed hypertext system like the World-wide Web. First, each document has its own unique encoding for each word. Thus, in one document the word "house" might be assigned a numeric value of 103, and in another document the word "house" might be assigned the number 31464. This document specific tokenization means that a unique table or vocabulary must be maintained as part of each document that maps the tokens assigned to words. Second, a vocabulary table must be stored with the compressed text and must be transmitted with compressed text to any processor (client or server) that will either further store, search or decompress the document. Third, when such a frequency table is used as the primary mechanism for determining the encoding of tokens in the compressed text, the assignment of tokens to words is so tightly optimized to the frequency distribution of words in the particular encoded document that when the existing text needs to be updated by even a few words or phrases the entire encoding scheme must be redone to accommodate any new strings that may be present. Fourth, in order to encode strings of characters that do not constitute natural language words, the strings are assigned their own unique tokens. Examples of such character strings are numeric values, codes, table framing characters or other character-based diagrams. While conventional compression methods may be acceptable when documents contain only a small number of such strings, the encoding scheme can break down if the document requires representation of larger numbers of such strings. Examples of documents that might be difficult to encode are those that contain scientific or financial tables that have many unique numbers. Fifth, the close optimization of token assignment to word frequency may be complicated with documents that contain large numbers of unique words. Examples of these kinds of document include dictionaries, thesauri, and technical material containing tables of chemical, drug, or astronomical names. Lastly, conventional compression techniques do not easily accommodate documents that include text from more than one national natural language, such as for example a translated document that includes both U.S. English and International French.