1. Field of the Invention
Preferred embodiments provide a method, system, and program for generating a deterministic table to determine boundaries between characters.
2. Description of the Related Art
Computer text editors display words on a page such that characters within each word remain together. Words are typically separated by a whitespace or a punctuation, such as a period, comma, semi-colon, etc. During operation, a word processor may have to determine morphological boundaries in text, such as characters, words, sentences or paragraphs. For instance, when displaying strings comprising unseparated words on a line, the word processor may have to determine where to break the string between the unseparated words or, during operation of a spell check program, the word processor must go from the beginning to the end of the document to locate each word on which to perform a spell check operation. One program used to locate word, sentence or character boundaries in text is the International Business Machines Corporation (xe2x80x9cIBMxe2x80x9d) BreakIterator. The BreakIterator program is a class in the Java Class Libraries, which is part of both the Java Developer Kit (JDK), which comprises programming tools for developers to create Java programs, and the Java Runtime Environment (JRE), which is the application to execute Java programs.** BreakIterator uses a state machine table to process each character to determine whether a morphological boundary has been reached.
** Java is a trademark of Sun Microsystems Inc. 
A state machine provides an output based on one or more received inputs. A state machine in effect memorizes previous inputs to determine the current output. A non-deterministic state machine can indicate a multiple number of output states for a given input, whereas a deterministic state machine indicates only one output state given the input. The behavior of a deterministic machine can be defined in a state transition diagram, such as that shown in FIG. 1, which illustrates an example of the type of state machine BreakIterator uses.
FIG. 1 illustrates a state transition diagram, which shows all possible states as circles. The circles are connected by arrows representing possible state transitions. The arrows are labeled by the input values that cause the particular transition, e.g., the arrow from state 2 to state 4 indicates that the input is a digit. A double circle represents an accepting state. If the current state is an accepting state, and the next character in the text does not indicate a transition along any of the transition lines, than a word boundary is placed after the accepting state position. A single circle indicates a non-accepting state. If the current state is one of the non-accepting states, and the character in the next position does not provide a transition to an accepting state, then the end of word boundary is placed at the position following the previous accepting state from which the transition to the current non-accepting state occurred. For instance, if the current state is a letter (2), then receiving another letter will cause a transition (letter) back to the letter state (2), receiving a digit will cause a transition (digit) to the digit (4) state, or receiving a word punctuation will cause a transition (wordPunct) to the word punctuation state (3). Word punctuation refers to punctuation marks that are acceptable for use within words, such as hyphens and apostrophes. Digit punctuation refers to punctuation marks acceptable within numbers, such as a decimal point, comma, etc. If the current state is a digit (4) and number suffix is the next input character, then the transition (numSuffix) will lead to the number suffix state (6). Because there is no transition possible out of the number suffix state (6), a word boundary is placed thereafter. Alternatively, at the punctuation state (3 and 5) from the letterstate (2) or the number state (4), there is no transition if the next character is further punctuation. This means that at the punctuation non-accepting states (3 and 5), if the next character is punctuation, then a word boundary will be placed at the previous accepting state, which is the previous letter (2) or number (4) state, respectively, from which the non-accepting punctuation state (3 or 5) was reached. After placing a word boundary, control proceeds to the start state (1) to process the next characters in the text to determine a next word boundary.
FIG. 2 illustrates a representation of the state machine in FIG. 1 as a two dimensional array that the text editor program uses to determine word boundaries. The shaded rows indicate accepting states. A row indicates a current state and the column indicates an input at a current state. The circles representing states in FIG. 1 are labeled with numbers indicating the corresponding row representing that state in the table in FIG. 2. The value in the cell indicates the next row or state based on an input of the column value. For instance, values in row 1 are at the start state. The cell value in row 1, column 1 indicates determining a letter character following the start position, which causes a transition to row 2, which represents the transition 2 to the letter state. At a letter state, which is indicated as row 2 in the table, receiving an apostrophe or other punctuation causes the use of transition 3 to go to the punctuation state, which is represented by row 3 in the table in FIG. 2. If a letter is received as input in the punctuation state, then transition 2 occurs back to the letter state, which is the value in the first column (the letter column) in row 3. Anything other than a letter at the punctuation state, shown as the other columns in row 3 following the letter (xe2x80x9cltrxe2x80x9d) column, indicates no transition, which causes the insertion of a word boundary. Thus, at a state i, the next position is determined by the value in row i at the column corresponding to the character type at the next position. If the cell correpsonding to row i and the column for the character type in the next position is a number, then the next state is provided at the row corresponding to the number. If the cell with row i and the column corresponding to the next character are empty, then the word boundary has been reached.
In the prior art IBM BreakIterator product, the BreakIterator programmer must manually create and modify the state machine table shown in FIG. 2. Such manual editing of these tables can be time consuming and cumbersome. Thus, there is a need in the art to provide an improved system for generating the state machine table.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for generating a data structure for use by a computer in determining a location of boundaries in text. The data structure is initialized and at least one regular expression is processed. Input characters in the at least one regular expression are then processed to determine at least one transition to at least one state. A determination is then made as to whether one input character would cause a transition to multiple states. If so, additional states are added to the data structure to transform the transition to multiple states to a deterministic transition.
In further embodiments, adding additional states comprises adding an additional state having a same number of output transitions as a number of non-deterministic output transitions from the non-deterministic state.
In still further embodiments, data structures are used to indicate states capable of transitioning to multiple states. In such case, each state having transitions to multiple states is updated to point to a new state providing deterministic transitions to the multiple states.
In certain implementations, the data structure is a table. In such case, initializing the data structure would involve defining columns in the table. Processing the input characters to determine at least one transition to at least one state comprises indicating one row as a decision point. An input character is received and a new row is added to the table for the input character. An input column corresponding to the input character in at least one decision point row is set to point to a row number of the added new row.
Preferred embodiments provide an algorithm for processing a set of regular expressions to generate a deterministic state table therefrom. With preferred embodiments, a word processing application developer need only define a set of regular expressions defining sequences of characters that form a known entity, such as a word, sentence or paragraph. In this way, if a software developer updates, modifies or completely replaces the set of regular expressions, the program may automatically generate a new deterministic state table machine from these regular expressions. Preferred embodiments allow developers to modify the set of regular expressions without having to spend time encoding a state table representing the regular expressions.