1. Field of the Invention
Preferred embodiments provide a method, system, and program for determining boundaries in a string using a dictionary and, in particular, determining word boundaries.
2. Description of the Related Art
Most computer text editors, such as word processing programs, display words on a page such that characters within each word remain together. Thus, if an end of a line is reached, any word that would extend beyond the end of the line will be displayed or positioned as the first word in the next line. This same principle for positioning words on a line applies to printing text. A legal break position comes between a non-whitespace character and a whitespace character (but not the other way aroundxe2x80x94this leads to a xe2x80x9cwordxe2x80x9d being a series of non-whitespace characters followed by a string of whitespace characters). Languages that do not use spaces may use punctuation marks to indicate a break point rather than the whitespace. In certain instances, some languages will not break on whitespaces (e.g., in French a space is placed between the last word in a sentence and a following question mark. In spite of this space, the break is still placed following the question mark to keep the word and question mark together).
For instance, Thai does not always separate words with spaces. However, when wrapping words of text on a display screen or printed paper, it is undesirable to split a word across two lines. One solution to ensure that line breaks in a string of unseparated words occur between words is to have the user of the text editor insert an invisible space between the words. Thus, when a Thai writer notices that certain compound words are broken in the middle of a word when wrapping to the next line, the Thai writer would manually insert an invisible space between the words to allow the lines to break in the proper places. This method can be tedious as it requires reliance on human observation and manual intervention to specify the places in the text where it is legal to break lines.
Another technique for determining legal breaks in text is a dictionary based boundary detection. Current dictionary based boundary detection techniques include in the dictionary common words that writers combine together without any break spaces, such as whitespaces. Current dictionary systems do not examine the document throughly for words that occur within the dictionary. When one of an instance of an unseparated word is found in the dictionary, a dictionary program or spell checker may propose a break to correct the problem. However, such methods are limited as the unseparated words that will be detected are limited to those encoded in the dictionary. Typically, current dictionary based boundary detection provides only a limited set of unseparated words to detect.
For the above reasons, there is a need in the art for an improved method, system, and program for determining boundaries within a string of words that does not have any word boundary indicators.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for determining boundaries in a string of characters using a dictionary. A determination is made of all possible initial substrings of the string in the dictionary. One initial substring is selected such that all the characters following the initial substring can be divided into at least one substring that appears in the dictionary. The boundaries follow the initial substring and each of the at least one substring that includes the characters following the initial substring.
In further embodiments, the longest possible initial substring is selected.
In still further embodiments, selecting the initial substring comprises selecting a longest possible initial substring that was not previously selected until one initial substring is selected such that the characters following the selected initial substring can be divided into at least one substring in the dictionary.
In certain embodiments, the substrings comprise words and the boundaries comprise word boundaries.
Preferred embodiments provide an algorithm for determining word boundaries in a string of unseparated multiple words. Preferred embodiments use an algorithm that will consider different possible word combinations until all the characters of the string fall within word boundaries, if such an arrangement is possible.