Natural language processing (NLP) systems are computer implemented software systems that intelligently derive meaning and context from natural language text. “Natural languages” are languages that are spoken by humans (e.g., English, French and Japanese). Computers cannot, without assistance, distinguish linguistic characteristics of natural language text. Natural language processing systems are employed in a wide range of products, including Information Extraction (IE) engines, spelling checkers, grammar checkers, machine translation systems, and speech synthesis programs.
Often, natural languages contain ambiguities that are difficult to resolve using computer automated techniques. Word disambiguation may be necessary because many words in any natural language have more than one sense. For example, the English noun “sentence” has one or more senses in common usage: one relating to grammar, where a sentence is a part of a text or speech, and one relating to punishment, where a sentence is a punishment imposed for a crime. Human beings use the context in which the word appears and their general knowledge of the world to determine which sense is meant.
Named entity recognition (NER) focuses on the proper detection and classification of proper noun sequences into semantic categories, such as person name, organization name and/or location name. NER may be the first major step in the more comprehensive task of IE.
For example, a cross-lingual retrieval of Chinese scientific documents using English as the query language may necessitate the use of NER. Other keyword retrieval systems may not be sufficient to handle queries of these types. Users may be interested in queries such as, finding people associated with alternative fuel for aerospace applications and retrieving a list of people and/or relevant publications that may match this query. Keyword querying, while very efficient for document retrieval, may not be sufficient to respond to such queries. It may be necessary to index the documents, identify key topics, and identify named entities. Thus, a response to such a query may first filter documents based on topic match, and subsequently return people names as results. Metadata associated with scientific documents may help in certain types of queries, but often the names of interest may be in the body of the document. Thus NER may be required. Although such queries may not require machine translation, although the results may require transliteration/translation back to English for readability.
In another Example, where Chinese documents may be translated by a machine translation system into English to facilitate searching and browsing, the user may often obtain poor search results due to name translation errors. Although native Chinese names may be translated fairly well, non-Chinese origin names may tend to be translated poorly. In addition to original English names, the latter category may include Japanese, Korean, and Vietnamese names along with non-Han Chinese names, such as Tibetan and Mongolian. Translation of non-Chinese names may include both a transliteration and translation component; the latter may be seen when a name includes a common noun such as Mount Everest.
Using native Chinese tagging and categorizing of named entities prior to machine translation may improve the quality of subsequent name translation, and even overall translation results. Current machine translation systems may be evaluated by methodology that computes a score based on similarity of automatic translations to a gold standard. Unfortunately, other metrics may not provide much weight to name translation as it may be possible to do well on other evaluations with relatively poor translation. On the other hand, when using machine translated text in retrieval, the incorrect translation of names may cause poor search results.
Various query templates have been developed which call for exact snippets of text to be returned in response to the specific query. Performing this task may require sophisticated IE techniques, which may extract and organize information such that specific responses (rather than returning relevant documents) may be generated in response to the query.
For example, one or more snippets may be returned as the result of such the query below.
Query: WHERE HAS [Tariq Aziz] BEEN AND WHEN?
Snippet: Iraq's Deputy Prime Minister Tariq Aziz begins a four-day visit to Italy and the Vatican with Pope John Paul II.
Others have shown that entity tagging can improve the quality of machine translation. Without considering context, entities may be translated as regular common nouns. For example, in Chinese, most of the characters used in person names are also used elsewhere in the language. To accurately translate a name it must first be identified as a name and then the means of translation depends on what kind of name it is, such as Chinese, Japanese, Korean, and/or English names
In translation systems, a string of characters in one language may be converted into a string of characters in another language. One challenge to such translation systems may be that a word in one language may have multiple possible translations in the other language depending on the sense of the word. For example, in English, the word “plant” can either be translated to the Chinese word “gongchang” which corresponds to the sense of “factory” or to “zhiwu” which corresponds to the sense of “vegetation”.
Therefore, a need exists for systems and methods to identify and categorize named or nominal entities in source language documents prior to or in lieu of a machine translation system translating the resulting snippets into a desired target language.