This disclosure relates to dictionaries for natural language processing applications, such as machine translation, non-Roman language word segmentation, speech recognition and input method editors.
Increasingly advanced natural language processing techniques are used in data processing systems, such as speech processing systems, handwriting/optical character recognition systems, automatic translation systems, or for spelling/grammar checking in word processing systems. These natural language processing techniques can include automatic updating of dictionaries for natural language applications related to, e.g., non-Roman language word segmentation, machine translation, automatic proofreading, speech recognition, input method editors, etc.
Non-Roman languages that use a logographic script in which one or two characters, e.g., glyphs, correspond to one word or meaning have more characters than keys on a standard input device, such as a computer keyboard on a mobile device keypad. For example, the Chinese language contains tens of thousands of ideographic characters defined by base phonetic or Pinyin characters and five tones. The mapping of these many to one associations can be implemented by input methods that facilitate entry of characters and symbols not found on input devices. Accordingly, a Western style keyboard can be used to input Chinese, Japanese, or Korean characters.
An input method editor can be used to realize an input method. Such input method editors can include or access dictionaries of words and/or phrases. Lexicons of languages are constantly evolving, however, and thus the dictionaries for the input method editors can require frequent updates. For example, a new word may be rapidly introduced into a language, e.g., a pop-culture reference or a new trade name for a product may be introduced into a lexicon. Failure to update an input method editor dictionary in a timely manner can thus degrade the user experience, as the user may be unable to utilize or have difficulty utilizing the input method editor to input the new word into an input field. For example, a user may desire to submit a new word, e.g., a new trade name, as a search query to a search engine. If the input method editor does not recognize the new word, however, the user may experience difficulty in inputting the new word into the search engine.
In some languages such as Chinese, Japanese, Thai and Korean, there are no word boundaries in sentences. Therefore, new words cannot be easily identified in the text, as the new words are compounded sequences of characters or existing words. This makes new word detection a difficult task for those languages. Additionally, once new words are identified, it is desirable to identify topics to which the new words and other existing words are related. The identification of such topics can improve the performance of a language model and/or a system or device using the language model for languages without boundaries in sentences, or for other languages.