With the rapid development of the Internet, computer users all over the world are becoming increasingly more exposed to writings that are penned in non-native languages. Many users are entirely unfamiliar with non-native languages. Even for a user who has some training in a non-native language, it is often difficult for that user to read and comprehend the non-native language.
Consider the plight of a Chinese user who accesses web pages or other electronic documents written in English. The Chinese user may have had some formal training in English during school, but such training is often insufficient to enable them to fully read and comprehend certain words, phrases, or sentences written in English. The Chinese-English situation is used as but one example to illustrate the point. This problem persists across other language boundaries.
Natural language processing refers to machine processing of a natural language input. The natural language input can take any one of a variety of forms, including a textual input, a speech input, etc. Natural language processing attempts to gain an understanding of the meaning of the natural language input.
A base noun phrase (baseNP) as referred to herein is a noun phrase that does not contain other noun phrases recursively. For example, consider the sentence “Measures of manufacturing activity fell more than the overall measures.” The elements within square brackets in the following marked up sentence are baseNPs:[Measures/NNS] of/IN [manufacturing/VBG activity/NN] fell/VBD more/RBR than/IN [the/DT overall/JJ measures/NNS]./.where the symbols NNS, IN, VBG, etc. are part-of-speech tags as defined in M. Markus, Marcin Kiewicx, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Computational linguistics 19 (2): 313-330, 1993.
Identifying baseNP in a natural language input is an important subtask for many natural language processing applications. Such applications include, for example, partial parsing, information retrieval, and machine translation. Identifying baseNP can be useful in other applications as well.
A number of different types of methods have been developed in the past in order to identify baseNP. Some methods involved applying a transform-based, error-driven algorithm to develop a set of transformation rules, and using those rules to locally update the bracket positions identifying baseNP. Other methods introduced a memory-based sequence learning method in which training examples are stored and a generalization is performed at run time by comparing the sequence provided in the new text to positive and negative evidence developed by the generalizations. Yet another approach is an error driven pruning approach that extracts baseNP rules from the training corpus and prunes a number of bad baseNP identifications by incremental training and then applies the pruned rules to identify baseNPs through maximum length matching (or dynamic programming algorithms).
Some of these prior approaches assigned scores to each of a number of possible baseNP structures. Still others dealt with identifying baseNP on a deterministic and local level. However, none of these approaches considered any lexical information in identifying baseNP. See, for example, Lance A. Ramshaw, Michael P. Markus (In press), Text Chunking Using Transformation-Based Learning: Natural Language Processing Using Very Large Corpora., Kluwer, The Second Workshop on Very Large Corpora. WVLC'95, pp. 82-94; Cardie and D. Pierce, Error-Driven Pruning of Treebank Grammars for BaseNP Identification, Proceedings of the 36th International Conference on Computational Linguistics, pp. 218-224, 1998 (COLING-ACL'98); and S. Argamon, I. Dagan and Y. Krymolowski, A Memory-Based Approach to Learning Shallow Language Patterns, Proceedings of the 17th International Conference on Computational Linguistics, pp. 67-73 (COLING-ACL'98).
In addition, it can be seen from the example sentence illustrated above that, prior to identifying baseNPs, part-of-speech (POS) tagging must be preformed. The prior techniques for identifying baseNP treated the POS tagging and baseNP identification as two separate procedures. The prior techniques identified a best estimate of the POS tag sequence corresponding to the natural language input. Only the best estimate was provided to the baseNP identification component. However, the best estimate of the POS tag sequence may not be the actual POS tag sequence which corresponds to the natural language input. This type of system leads to disadvantages. For example, using the result of the first step (POS tagging) as if it were certain and providing it to the second step (baseNP identification) leads to more errors in identifying baseNP.