1. Field of the Invention
Preferred embodiments provide a method, system, and program for generating a table to use to determine boundaries between characters and, in particular, generating the table from a state machine table.
2. Description of the Related Art
Computer text editors display words on a page such that characters within each word remain together. Words are typically separated by a whitespace or a punctuation, such as a period, comma, semi-colon, etc. During operation, a word processor (text editor) may have to determine morphological boundaries in text, such as characters, words, sentences or paragraphs. For instance, during the operation of a spell check program, the word processor must go from the beginning to the end of the document to locate each word on which to perform a spell check operation. One program used to locate word, sentence or character boundaries in text is the International Business Machines Corporation (xe2x80x9cIBMxe2x80x9d) BreakIterator. The BreakIterator program is a class in the Java Class Libraries, which is part of both the Java Developer Kit (JDK), which comprises programming tools for developers to create Java programs, and the Java Runtime Environment (JRE), which is the application to execute Java programs. BreakIterator uses a state machine table to process each character to determine whether a morphological boundary has been reached.
A state machine provides an output based on one or more received inputs. A state machine in effect memorizes previous inputs to determine the current output. A non-deterministic state machine can indicate a multiple number of output states for a given input, whereas a deterministic state machine indicates only one output state given the input. The behavior of a deterministic machine can be defined in a state transition diagram, such as that shown in FIG. 1, which illustrates an abstraction of the type of state machine BreakIterator uses to determine morphological boundaries based on a previous character and a next transition character.
FIG. 1 illustrates a state transition diagram, which shows all possible states as circles. The circles are connected by arrows representing possible state transitions. The arrows are labeled by the input values that cause the particular transition, e.g., the arrow from state 2 to state 4 indicates that the input is a digit. A double circle represents an accepting state. If the current state is an accepting state, and the next character in the text does not indicate a transition along any of the transition lines, then a word boundary is placed after the accepting state position. A single circle indicates a non-accepting state. If the current state is one of the non-accepting states, and the character in the next position does not provide a transition to an accepting state, then the end of word boundary is placed at the position following the previous accepting state from which the transition to the current non-accepting state occurred. For instance, if the current state is a letter (2), then receiving another letter will cause a transition (letter) back to the letter state (2), receiving a digit will cause a transition (digit) to the digit (4) state, or receiving a word punctuation will cause a transition (wordPunct) to the word punctuation state (3). Word punctuation refers to punctuation marks that are acceptable for use within words, such as hyphens and apostrophes. Digit punctuation refers to punctuation marks acceptable within numbers, such as a decimal point, comma, etc. If the current state is a digit (4) and number suffix is the input character, then the transition (numSuffix) will lead to the number suffix state (6). Because there is no transition possible out of the number suffix state (6), a word boundary is placed thereafter. Alternatively, at the punctuation state (3 and 5) from the letter state (2) or the number state (4), there is no transition if the next character is further punctuation. This means that at the punctuation non-accepting states (3 and 5), if the next character is punctuation, then a word boundary will be placed at the previous accepting state, which is the previous letter (2) or number (4) state, respectively, from which the non-accepting punctuation state (3 or 5) was reached. After placing a word boundary, control proceeds to the start state (1) to process the next characters in the text to determine a next word boundary.
FIG. 2 illustrates a table representing the state machine in FIG. 1 that the text editor program uses to determine word boundaries. The shaded rows indicate accepting states. A row indicates a current state and the column indicates an input character. The value in the cell indicates the next row (or state) to transition to if the next input character is the character for this cell""s column. For instance, values in row 1 are at the start state. The cell value in row 1, column 1 indicates reading a letter character following the start position, which causes a transition to row 2, which represents the transition to the letter state 2. At a letter state, which is indicated as row 2 in the table, receiving an apostrophe or other punctuation causes the transition that leads to the punctuation state, which is represented by row 3 in the table in FIG. 2. If a letter is received as input in the punctuation state, then transition 2 proceeds back to the letter state, which is the value in the first column (the letter column) in row 3. Anything other than a letter at the punctuation state, shown as the other columns in row 3 following the letter (xe2x80x9cltrxe2x80x9d) column, indicates no transition, which causes the insertion of a word boundary. Thus, at a state i, the next position is determined by the value in row i at the column corresponding to the character type at the next position in the input text. If the cell correpsonding to row i and the column for the character type in the next position is a number, then the next state is provided at the row corresponding to the number. If the cell with row i and the column corresponding to the next character is empty, then the word boundary has been reached.
The state machine shown in FIGS. 1 and 2 is used to proceed forward to locate word boundaries. However, the program may need to determine the nearest word boundaries before and after an arbitrary location, such as the location of word boundaries before and after a randomly accessed location in the text, such as a cursor location in the middle of a word. In such case, the program needs to move backwards in the sentence to locate a position that is unambiguously a boundary position. For instance, the program could proceed to the beginning of the document, and then use the state machine shown in FIGS. 1 and 2 to proceed forward to an end of word boundary following the randomly accessed position, i.e., current cursor position. In this way, the program would know that the word boundary following the randomly accessed position is the end of the word including the random position and a determined word boundary immediately preceding the random position is the boundary for the beginning of the word. The program could then highlight the characters between the determined beginning and end of word boundaries.
Word processing programs typically do not back up to the beginning of the document to start determining word boundaries because the cursor or other randomly accessed position could be located far below the top of the document, thereby requiring numerous unnecessary word boundary detections well before the cursor position. The prior art BreakIterator class backs-up to a position that is an unambiguous word break. This process is referred to as random access iteration to determine a word boundary prior to any randomly accessed point in a document. From this backed-up position, the program then uses the logic of FIGS. 1 and 2 to move forward to determine the word boundary following the cursor or randomly accessed position.
In the prior art IBM BreakIterator class, a backward state table is created to determine unambiguous word boundaries between characters. The state table used with BreakIterator is filled in directly by the programmers on an ad hoc basis. The process of generating the backwards state table by having programmers directly encode the table is substantially labor intensive and time consuming.
Thus, there is a need in the art to provide an improved method, system, and program for determining where in a document to back-up to be at a position that is unambiguously before the word break prior to a randomly accessed position within a document.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for generating a table for use by a computer in determining a location of a boundary between two characters. A first table indicates a boundary between characters when processing text in a first direction. A second table is generated based on the content of the first table. The second table can be used to determine whether one boundary is located between any two consecutive characters processed in a second direction.
In further embodiments, the first table comprises rows and columns defining a state machine table indicating a next state based on a current state and an input character. Further, the second table indicates whether one boundary should be placed between any two consecutive characters processed in the second direction.
In still further embodiments, the second table is a paired comparison table implemented as a state machine.
Preferred embodiments provide an algorithm for automatically generating a table for determining unambiguous boundaries when moving in one direction in text from a state machine table for determining all boundaries in the opposite direction. With the preferred embodiments, programmers do not have to manually modify the table indicating whether boundaries should be inserted for use in processing characters in the text in a backward direction because such table, and other related tables, are generated from the state machine table for processing characters in the forward direction.