1. Field of the Invention
The present invention relates in general to word processing and other systems which produce and read text files, and in particular to a system for compressing such text files for compact storage and rapid transmission.
2. Description of Related Art
Although computer hardware improvements have progressively increased the capacity and reduced the cost of data storage media, interest in compressing computer data files has continued. With computers increasingly interlinked to one another via narrow bandwidth channels, it's quicker to transmit a data file from one computer to another when its compressed. The Internet, with its World Wide Web of computers, has made vast quantities of documents stored on thousands of computers around the world readily available to anyone having a computer, a modem, a phone line, and some inexpensive browser software. However, though documents are readily available through the Internet, they are not always quickly available. Modems and telephone lines have limited bandwidth and large documents require a fair amount of transmission time.
A great many data compression schemes have been proposed and are in use. Some of these schemes are directed primarily to compressing text files representing documents written in a character-based language such as English. Such text files are usually sequences of 8-bit (one byte) character codes, each successive byte representing a successive character of the document in accordance with a standardized encoding code system. Most 8-bit encoding schemes are variations on the ASCII encoding system which assigns common upper and lower case alphanumeric characters, punctuation marks and control characters to the lower 128 ASCII codes. Since an 8-bit encoding system encodes up to 256 characters, the remaining upper 128 codes may be assigned to various special characters such as graphics characters, mathematical symbols, special language characters and the like. While an 8-bit ASCII encoding system is a convenient way for a computer to handle characters when processing text documents, it is not a particularly compact way of representing documents.
"Context sensitive encoding" compression schemes make use of fact that in a given language characters do not appear in random sequence but rather tend to occur more frequently in some groups than others. For example in English the pair "qu" occurs more frequently than the pair "qx". The triplet "ing" occurs more often than the triplet "inx". In a context sensitive encoding system, the character represented by a code value depends on the character(s) preceding it in the text file. This enables characters to be represented with fewer bits. U.S. Pat. No. 4,672,679 issued Jun. 9, 1987 to Freeman describes a typical context sensitive encoding compression system.
"Dictionary" type data compression systems capitalize on the fact that words are often repeated in a document. If we use a dictionary to assign, for example, a 16-bit code to each unique word, then we can represent each word with two bytes instead of representing each character of a word with one byte. Since most words have more than 2 characters, a level of compression can be achieved if both compressing and decompressing software have the same dictionary available. Unfortunately 16-bits may be insufficient to uniquely represent each word that may be encountered in a document, particularly since documents containing spelling errors. Also new words make old dictionaries rapidly obsolete. Thus in systems having fixed dictionaries, words not found in a dictionary cannot be compressed. Some systems using fixed dictionaries also create second "adaptive" dictionaries for representing document words that do not appear in the fixed dictionary. The adaptive dictionary is added to the compressed document so that decompression software can refer to it when it cannot find a word in the fixed dictionary. Typical of this approach are U.S. Pat. No. 5,530,645 issued Jun. 25, 1996 to Chu and U.S. Pat. No. 4,899,148 issued Feb. 6, 1990 to Sato et al. One major disadvantage to fixed and "fixed+adaptive" dictionary systems is that the receiving computer must already store a copy of the fixed dictionary. Such systems do not lend themselves well to open networks such as the Internet where there is no assurance that the client computer receiving the document has the appropriate fixed dictionary. In open network environments it is preferable to transmit "self-extracting" compressed files able to decompress themselves without relying on fixed dictionaries or other information stored by the receiving computer.
"Adaptive dictionary systems" employ only a single dictionary created as the text file is being compressed. An adaptive dictionary is normally much smaller than a fixed dictionary because most documents use a substantially fewer number of unique words than would appear in a fixed dictionary. However, though the text file itself can be substantially compressed, much of the compression advantage is lost when the adaptive dictionary must be stored or transmitted with the compressed text file to provide the information needed for decompression. Also prior art dictionary systems typically do not compress characters such as spaces, punctuation and carriage returns that normally appear between words. Yet these characters typically comprise a significant portion of a document.
There have been efforts to compress spelling dictionaries. U.S. Pat. No. 4,747,053 issued May 24, 1988 to Yoshimura, discloses a relatively effective system for compressing a spelling dictionary in which all words of the spelling dictionary are arranged in alphabetical order. Each dictionary entry consists of several parts. A first part of a dictionary represents a number of leading characters the word has in common with the word of the preceding dictionary entry. A second part of a dictionary entry indicates where the word's suffix, if any, appears on a table of common suffixes. A third part of the entry consist of standard character codes for each character not represented by the first or second parts of the entry. While this system produces a relatively high degree of compression for a spelling dictionary, it provides no further compression for characters occurring between the leading characters and the suffix.
What is needed is a system for rapidly and substantially compressing a text document so that it may be compactly stored, rapidly transmitted and rapidly expanded without need for supplemental information.