The present invention is related to word-breakers. More particularly, the present invention is related to new word collection methods for use in word-breaking.
Word-breaking is an important component of natural language processing applications that process textual inputs. In particular, word-breaking is important in most search engines. The search engines perform word-breaking on input strings for several purposes. For example, word-breaking is applied to input strings to determine component words of a compound word.
Word-breaking is especially important in agglutinative languages such as Japanese, Chinese and Korean. An agglutinative language is a language in which words are made up of a linear sequence of distinct morphemes, and each component of meaning is represented by its own morpheme. Other examples of agglutinative languages include Sumerian, Hourrite, Ourartou, Basque and Turkish. Generally, in agglutinative languages, words can be compounded without spaces separating the component words.
Search targets frequently contain various new words which are not yet in dictionaries, and which are not represented in a custom lexicon. When unknown words are included in the input string of a search engine query or in a document to be indexed and searched, it is difficult for the word-breaker to properly word-break the string. This is particularly true in languages in which the words are not separated by spaces. This presents the potential for lower precision/coverage in the search results.
Collecting new words for a custom lexicon used by the word-breaker is an endless task. Existing techniques for collecting the new words for the custom lexicon are time consuming and burdensome. Typically, new words are manually collected by search site owners for addition to the custom lexicon used by that search site. New words are also manually collected by developers for inclusion in the next product generation's system dictionary. The time consuming and labor intensive nature of these new word collection techniques leaves much to be desired.