Natural language processing encompasses computer understanding, analysis, manipulation, and generation of natural language. From simplistic natural language processing applications, such as string manipulation (e.g., stemming) to higher-level tasks such as machine translation and question answering, the ability to identify and extract entity names and jargon terms in a text corpus is very important. Being able to identify proper names in the text is important to understanding and using the text. For example, in a Chinese-English machine translation system, if a person name is identified, it can be converted to pinyin (system for transliterating Chinese characters into the Latin alphabet) rather than being directly translated.
Entity names include the names of people, places, organizations, dates, times, monetary amounts and percentages, for example. Name entity and jargon term extraction involves identifying named entities in the context of a text corpus. For example, a name entity extraction must differentiate between “white house” as an adjective-noun combination, and “White House” as a named organization or a named location. In English the use of uppercase and lowercase letters may be indicative, but cannot be relied on to substantially determine name entities and jargon terms. Moreover, case does not aid name entity and jargon term recognition and extraction in languages in which case does not indicate proper nouns (e.g., Chinese) or in non-text modalities (e.g., speech).
There are three general methods that are typically employed for name entity and jargon term recognition and extraction. The first is to construct rules and keyword sets manually. This involves the use of hand-crafted modules encoding linguistic knowledge specific to the language and document genre. This method is easily implemented, but time consuming and prone to errors, moreover this model is not easily portable to new languages. A second technique involves the use of a statistical model (e.g., Hidden Markov Model) that requires a great deal of annotated training data. A third method is memory-based learning which treats the problem of entity extraction as a series of classification processes. Each of these methods is language dependent and relies on past experience. These are serious drawbacks in dealing with unrecognized entity names and jargon terms.