Words are the base for many language-processing technologies. For example, vocabularies with different properties are the base of natural language understanding, machine translation, automatic abstract, etc. For information retrieval, words are used as searching units to reduce the redundancy of search results. For speech recognition, words are also used as the lowest level of language information to resolve the character level acoustic ambiguities. Further, language models are often built on word level to resolve the acoustic ambiguity. For some languages, however, such as Chinese and Japanese, there is no word boundary in written languages, and words are not well defined. For example, some people may think “” as one word, and some other people may think they are 2 words “” and “”. Generally a Chinese word is composed of one or more Chinese characters, and is a basic unit with certain meaning. There are different vocabularies collected manually with different coverage for different domains. However it's not an easy task to collect such vocabularies. Furthermore, languages are developing with new words emerging dynamically. For example, “” was not a word some time ago, but it is now widely used. It is very demanding to automatically extract new words given a large amount of corpus.
A need therefor exists for a method and system for automatically extracting new words from a corpus.