In a typical word processing system each paragraph exists internally as one or more strings of characters, and must be broken into lines before it can be displayed or printed. For example, the typical line-breaking algorithm has a main inner loop which adds the width of the current character to the sum of the widths of previous characters, and compares the new total to the desired line width. The program will execute this loop until the number of characters in the line exceeds the number of characters that can be fit in the line. At this point, the program can either end the line with the last full word, or hyphenate the current word and put the word portion after the hyphen at the beginning of the next line.
Two problems with this process cause it to run too slowly: first, the inner loop must be executed for every character in the line; second, if hyphenation is enabled, the context of the character that overran the margin must be deduced--that is, a determination must be made whether the character is a space, punctuation mark, or part of a word. In general, all operations that require processing of each character such as pagination and scrolling through the document are very slow. In addition, operations that depend on the interpretation of the document as a sequence of words, such as hyphenation, spell-checking and search and replace are also very slow.
U.S. Pat. No. 4,181,972 (Casey) relates to a means and methods for automatic hyphenation of words and discloses a means responsive to the length of input words, rather than characters. However the Casey patent does not store the word length obtained for future use; at the time that hyphenation is requested, the Casey method scans the entire text character-by-character. The Casey patent also does not compute breakpoints based on the whole word length. Instead, Casey teaches the use of a memory-based table of valid breakpoints between consonant/vowel combinations.
U.S. Pat. Nos. 4,092,729 (Rosenbaum et al) and 4,028,677 (Rosenbaum) relate to methods of hyphenation also based on a memory table of breakpoints. Rosenbaum '729 accomplishes hyphenation based on word length (see claim 6), but the method disclosed is different than the invention disclosed here. In Rosenbaum '729, words are assembled from characters at the time hyphenation is requested, and then compared to a dictionary containing words with breakpoints. The invention disclosed here assembles the words at the time the document is encoded, and does not use a dictionary look-up technique while linebreaks are computed.
What is required is a better method of representing the text for document processing. A natural approach for reducing the computational intensity of the composition function would be to create data structures that would enable computation a word at a time rather than a character at a time. The internal representation of the text, in this case, is a token which is defined as the pair: EQU &lt;type, data&gt;
where the type is a unique identifier for each class of token, and data is the data associated with a particular type of token. A token can be represented in a more compact way as EQU &lt;type, pointer&gt;
where the pointer is the address of the data associated with that token. This form of the token is more easily manipulated since entries are the same length. An even more compact representation of a token is achieved when the token type is included in the data block; this reduces the fundamental token object to a pointer. Since the type information is still present in the datablock, a pointer of this form is still appropriately referred to as a token. In the past, several approaches used an internal representation of text that was some form of token, and all had drawbacks that prevented them from being applied to rapid text composition.
Numerous prior systems have used tokens for editing computer programs. See, for example: Copilot: A Multiple Process Approach to Interactive Programming Systems, Daniel Charles Sweinhart, July 1974, phD Thesis, Stanford University. Swinehart uses tokens to maintain a relationship between the source code (text) and the corresponding parse tree that the compiler uses to translate the program into machine instructions. After each editing operation the lines of source code that changed are rescanned into tokens, the parse tree is rebuilt and finally, the parse tree is inspected for correctness. These systems are very popular for creating and modifying programs written in languages like Lisp, but tend to be fairly slow and laborious. The benefit to the user is that there is a greater likelihood that the changes made to a program will result in errors being removed rather than introduced.
A second known approach uses tokens as the fundamental text unit to represent English words rather than elements of a computer programming language. In Lexicontext: A Dictionary-Based Text Processing System, John Fransis Haverty, August 1971, masters thesis, Massachusetts Institute of Technology, a token points to a lexicon entry containing the text for the word; a hashing function is then used to retrieve the data associated with the entry which can be uniquely defined for each token. This encoding method is very general, but at the expense of performance.
Furthermore, since a principal application of Haverty's method is as a natural language interface to an operating system, the lexicon is global and thus independent of any particular document. This architecture is practical in an environment where the information is processed on a single central processor and when the entire universe of words that would be encountered is known in advance. Even if words could be added to the global lexicon, there would still be problems in a distributed environment where processors may not be connected to a network or other communications device. In this case, the lexicons would quickly diverge, and documents created on one machine could not be correctly interpreted on any other machine. Another major drawback of this approach is that if an error is detected in the main lexicon all of the documents encoded with the flawed lexicon would need to be reprocessed--if it was even possible to rebuild the documents. Because the main lexicon must by design be very large, it would be impractical to maintain the lexicon as resident in main memory. A large lexicon not resident in main memory would impose a tremendous performance penalty.