The invention relates to a method of segmenting a connected text into words, including the steps of reading an input string representing the connected text; identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and outputting at least one of the identified word sequences.
The invention farther relates to a system for segmenting a connected text into words; the system including means for reading an input string representing the connected text; means for identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and means for outputting at least one of the identified word sequences.
Increasingly advanced natural language processing techniques are used in data processing systems, such as speech processing systems, handwriting/optical character recognition systems, automatic translation systems, or for spell/grammar checking in word processing systems. Such systems frequently use statistical information relating to individual words or word sequences. The statistical information is obtained by analyzing large text corpora. For the analysis, individual words need to be identified in the text. In many languages, including the western languages, words are separated by boundary markers, such as a space or other punctuation marks, making identification easy. However, many other languages do not have boundary markers between words. Examples of such languages are many Asian languages such as Chinese, Japanese, and Hangul. Such languages are sometimes referred as agglutinative languages. Typically, these languages are written using special characters (xe2x80x9cideographsxe2x80x9d), which each represent one or more syllables and usually a concept or meaningful unit. A word consists of one or more of these characters. A reader of a text in such a language must identify the boundaries of those words to make sense of the text. For many applications only one sequence of words must be identified.
From U.S. Pat. No. 5,448,474 a method and system for isolation of Chinese words from connected Chinese text is known. In this system a dictionary lookup process is performed, where all substrings of a text are identified. For each character of the text, it is checked for each word of the dictionary whether the word matches the text starting at that position. As an example, for the text xe2x80x9csoftwarexe2x80x9d, at position 0 (first character of the text) a match is found for the words xe2x80x9csoxe2x80x9d, xe2x80x9csoftxe2x80x9d, and xe2x80x9csoftwarexe2x80x9d; at position 1 for the words xe2x80x9cofxe2x80x9d and xe2x80x9coftxe2x80x9d; at position 4 for xe2x80x9cwarxe2x80x9d and xe2x80x9cwarexe2x80x9d; at position 5 for xe2x80x9caxe2x80x9d and xe2x80x9carexe2x80x9d; and at position 6 for xe2x80x9crexe2x80x9d. For each match an entry is created in a table. The entry comprises the matching word, the position of the text at which the match starts, and the length of the word. If at a position no matching word is found, an entry is made in the table containing the individual character. In this way all possible words and unmatched characters are added to the table. Next, the number of entries in the table is reduced, based on criteria such as that a word must start adjacent to the end of a preceding word and must end adjacent to the start of a next word. Since in this way overlapping words (which are not adjacent) will be removed, parts of the text may not be covered by identified words. A separate restoration process is performed to correct undesired deletion of overlapping words, based on a criterion to keep the longest matching overlapping word. Finally, again all words are removed which are not adjacent the end or beginning of the text or another non-deleted word. The final outcome may contain several possible word sequences. Information regarding the frequency of occurrence of words in general texts may be used to select one of the sequences. For instance, a sequence with a two-character Chinese word may be selected above a same sequence with the two characters represented by two single-character words, since two character words are more common than single character words.
The known isolation procedure is complex and requires a restoration process to correct wrong deletions.
It is an object of the invention to provide a method and system of the kind set forth, which is more efficient.
To meet the object of the invention, the method is characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequence(s) in the input string in an iterative manner by:
taking the input string as a working string;
for each word of a dictionary:
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string:
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string.
By building a tree structure, analyzing the input string automatically results in identifying only words which are adjacent to a preceding word. All identified word sequences of which the last word ends at the end of the input string are in principle possible. In this way words which are not possible (in view of preceding words) are not considered as candidates. This reduces the amount of data to be processed. Moreover, complex procedures of deleting words and reintroducing overlaps are not required. Segmenting the sample string xe2x80x9csoftwarexe2x80x9d according to the invention results in a logical tree structure with two main branches, one branch having a single node representing the word xe2x80x9csoftwarexe2x80x9d and one branch having two linked nodes representing the words xe2x80x9csoftxe2x80x9d and xe2x80x9cwarexe2x80x9d respectively. Consequently only three entries are required instead of 10 in the prior art system.
In an embodiment according to the invention as described in the dependent claim 2, a plurality of new words are added with different lengths if a predetermined criterion is met. By adding an unknown sequence of characters not just as single character words to a data structure, it becomes possible to identify multi-character new words in a simple way. This makes the procedure suitable for languages, such as Japanese, wherein many single characters do not represent a word. Furthermore, it allows identifying a multi-character word as the preferred new word in which case the single character words do not need to be entered to the dictionary. In this way it is avoided that the dictionary gets xe2x80x98pollutedxe2x80x99 with single character words. Having many single character entries in a dictionary reduces the chance of a correct segmentation of the text into words. As an example, the text xe2x80x9cthisbookxe2x80x9d might get segmented into the sequence of words xe2x80x9ctxe2x80x9d, xe2x80x9chisxe2x80x9d, and xe2x80x9cbookxe2x80x9d if the single character xe2x80x9ctxe2x80x9d is in the dictionary.
In an embodiment according to the invention as described in the dependent claim 3 such criterion is a global decision based on whether or not a word sequence could be identified using the existing dictionary. If no sequence could be identified, new words are added. This test may be performed by first building a tree structure using only known words of the existing dictionary, and after the tree has been build verifying whether at least one path represents a word sequence which matches the entire input string. The verification can be very simple by during the building of the tree structure setting a parameter (end-of-string-reached) as soon as a first path through the tree has reached the end of the input string.
In an embodiment as defined in the dependent claim 4, the new words are added to one or more end nodes of paths whose corresponding word sequences do not match the entire input string. Such nodes may simply be located by following paths through the tree and verifying if the end node of the path corresponds to the end of the input string (i.e. the words match as well as the location in the string. This can be checked in a simple manner by verifying whether the part of the input string associated with the end node is empty, indicating that matching words have been found along the entire string). In a preferred embodiment, a global decision is taken whether or not to add new words (as described above). If new words are to be added, the tree structure is re-built. During the re-building of the tree, nodes to add new words to are found at those places in the tree where no word of the dictionary matches the remaining part of the input string (and not the entire string has been processed yet).
In an embodiment as defined in the dependent claim 5, it is calculated how many words match the beginning of the working string. If this number is below a threshold, then new words are added. How many new words are added may depend on the number of matching words found, where preferably more new words are added if few matches were found. In this way, a desired number of alternatives in word sequences can be created. In an embodiment as defined in the dependent claim 6, as an extreme the threshold may be one, resulting in new words being added if not a single word of the existing dictionary matched the beginning of the working string. The embodiment of claims 5 and 6 are preferably used for taking local decisions in the tree, in the sense that for each branch in the tree it is decided whether or not to add (more) new words.
In an alternative embodiment for taking local decisions as defined in the dependent claim 7, the number of new words already in a path are taken as a measure for determining whether new words need to be added. In an xe2x80x98idealxe2x80x99 scenario, if for the first time in a path a new word is required only one or a few new words are added, which indeed all are possible (from the perspective of a human reader) In reality many candidates each having a different length may need to be tested. Some wrong candidate words can cause a misalignment in identifying words in the remainder of the string. Without special measures such a misalignment would result in adding further new words (possibly followed by further new words, etc.). By allowing, for instance, two or three new words in a path it is avoided that the tree expands rapidly with many sequences which are caused by wrong new words.
In an alternative embodiment for taking local decisions as defined in the dependent claim 8, the likelihood of the word sequences (and corresponding paths through the tree) are calculated. If the likelihood drops too low, the path is no longer extended. In this way, unrealistic segmentations, which may include new words, are not considered further. Advantageously, the threshold is dynamically established to ensure a relative ranking. If already one or more sequences have been identified (with a calculated likelihood) other sequences are only processed as long as the sequence has a higher likelihood. Preferably, a new word is given a relatively low likelihood, where the likelihood may depend on the length of the new word. In this way the likelihood of a word sequence decreases with the number of new words in the sequence, as defined in the dependent claim 9. In this way it is avoided that a wrong choice of a new word (which results in the remainder of the string also being misaligned and requiring further new words) results in a continuously expanding tree with many new words.
According to an embodiment as defined in the dependent claim 10, the length of new words is restricted to K characters, K greater than 1. Preferably K equals five, ensuring that, particularly for Asian languages with mainly short words, most words can be identified without building an excessively large tree.
According to an embodiment as defined in the dependent claim 11, a path in the tree is only considered to represent a valid segmentation if the last word of the path ends aligned with the ending of the input string. This allows identifying valid sequences by backtracing starting only from those end nodes (leaves) in the tree whose associated word is aligned with the ending of the input string.
According to an embodiment as defined in the dependent claim 12, a statistical N-gram language model is used for determining a most likely path through the tree. In this way a founded decision is taken to select a most likely sequence from several possible sequences. The words of this sequence are output as representing the segmented text. Particularly if the method is used for building a lexicon (vocabulary and/or language model) for speech recognition systems, it is preferred that the already present default lexicon with its N-gram language model is used. Preferably a 2-gram or 3-gram is used if the vocabulary is large (e.g. over 10,000 entries).
To meet the object of the invention, the system is characterized in that in that the means for identifying at least one word sequence is operative to build tree structure representing word sequencers) in the input string in an iterative manner by:
taking the input string as a working string;
for each word of a dictionary:
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string:
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string.