1. Field of the Invention
The present invention relates to data and word processing and communications systems and, more particularly, to a method and apparatus for compressing textual information for storage or transmission. In this context, textual information is defined as any information which is represented by a structured order of symbols or characters selected from a defined set, or alphabet, of symbols or characters. Common examples of textual information may include documents such as letters, reports and manuscripts written, for example, in English, German or French, business and accounting records, scientific data, and graphic displays comprised of arrangements of graphic symbols.
2. Prior Art
A recurring problem in data processing and communications systems is that of storing, processing and communicating ever increasing volumes of information. The information handling requirements of such systems increases at least as rapidly, and often more rapidly, than does the capacity of available memories and data links. In addition, there are often physical or economic limits upon the memory or communications capability that may be provided with or added to a particular system. As a result, other methods than increasing memory or data link capacity have been developed to enable systems to handle increased volumes of information. One such method is referred to as data compression, wherein information communicated into a system by a user of the system is transformed by the system into a more compact, or reduced, form for storage or transmission. The information may subsequently be transformed, or decompressed, from its reduced form to its original form to be communicated to the user.
Typically, the language in which information is communicated between a system and a user of the system contains a significant degree of redundancy. That is, the language in which information is expressed contains more information than is required to completely and accurately represent the actual information. A common example occurs in word processing wherein information, that is, text, is communicated between the user and system in the English language, including punctuation and format characters such as periods, commas, spaces, tabs and line returns. Text compression is possible because of such redundancy and essentially transforms a user language text into a more compact form by deleting the redundant information from the user language version of the text.
Text compression methods of the prior art have been based upon distributional redundancy, that is, the nonlinearity in frequency of use or occurrence of certain characters, character combinations, and words in particular user languages. For example, in the English language the characters `e` and `space` occur more frequently than `y` or `z`, and certain letter pairs, or digraphs, such as `th` and `es`, and certain words, such as `the`, `of`, and `and`, occur frequently.
Prior art schemes have used this distributional redundancy to achieve compression by assigning variable length code words, or characters, to represent the frequently appearing characters, character combinations and words in particular languages. That is, the most frequently appearing character, character combinations and words are assigned short code characters. Less common character combinations and words are, depending upon frequency of occurrence, assigned longer code characters or are `spelled out` as sequences of the more frequently occurring characters, character combinations and words.
The actual compression and decompression of text in data and word processing and communications systems is generally implemented through the use of `look-up` tables relating the frequently occurring characters, character combinations and words to the corresponding assigned code characters. The compression and decompression tables are generated separately from the actual compression/decompression operation and typically require a thorough, detailed linguistic analysis of very large volumes of text in the user language. It should be noted that while it is possible to assign a code character to each possible word and character in a particular language, the resulting code characters and tables become so large as to require more memory space than would be saved by text compression.
Distributional redundancy methods of text compression are very dependent upon the linguistic characteristics of the individual languages in which the original texts are created, particularly with regard to larger linguistic units, such as character combinations and words. For example, English, German, French, Russian, Italian and the Scandanavian languages all have distinctly different linguistic characteristics, require different methods of analysis, and result in very different compression and decompression tables. As such, the compression schemes of the prior art have required a detailed linguistic analysis of of very large volumes of text in each separate user language in order to generate compression/decompression tables.
Because of the linguistic dependency of distributional redundancy methods, in particular with regard to the larger linguistic units, it is difficult to develope a completely general purpose method for analyzing distributional redundancy for a broad range of languages. Moreover, and for the same reasons, the compression/decompression tables for a particular language may depend upon the particular `dialect` of text to be operated upon; for example, the linguistic characteristics for business, scientific and literary text may differ sufficiently to require separate tables for each application.
Further, because such methods use linguistic units and code words of differing sizes, compression/decompression requires relatively sophisticated programs with complex parsing capabilities and corresponding increases in processing capabilities and times and program memory space. For the same reason, the compression and decompression operations may not be symmetric, that is, may require separate tables and the execution of different routines with, again, increased processing and memory requirements. Finally, and for the same reasons, such methods are not suitable for continuous, in-line text processing or communication as the text must be processed as a series of small `batch` operations, where the size of the batches are determined by the sizes of the linguistic units and code words.