Sentences in a typical newspaper story include idioms, ellipses, and ungrammatical constructs. Since authentic language defies text-book grammar, the basic parsing paradigm must be tuned to the nature of the text under analysis.
Hypothetically, parsing could be performed by one huge unification mechanism as described in the literature: S. Schieber, "At Introduction to Unification-based Approaches to Grammar", Center for the Study of Language and Information, Palo Alto, Calif., 1986 and M. Tomita, "Efficient Parsing for Natural Language", Lluwer Academic Publishers, Hingham, Mass., 1986. Such a mechanism would receive its tokens in the form of words, characters, or morphemes, negotiate all given constraints, and produce a full chart with all possible interpretations.
However, when tested on a real corpus, (i.e., Wall Street Journal (WSJ) news stories), this mechanism collapses. For a typical well-behaved 33-word sentence it produces hundreds of candidate interpretations.
To alleviate problems associated with processing real text, a new strategy has emerged. A preprocessor, capitalizing on statistical data has been described in the literature: K. Church, W. Gale, P. Hanks, and D. Hindle, "Parsing, Word Associations, and Predicate-Argument Relations", Proceedings of the International Workshop on Parsing Technologies, Carnegie Mellon University, 1989 and I. Dagan, A. Itai, and U. Schwall, "Two Languages are More Informative Than One", Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., 1991. Such a processor is trained to exploit properties of the corpus itself, highlights regularities, identifies thematic relations, and in general, feeds digested text into the unification parser.
Consider the following WSJ, (Aug. 19, 1987) paragraph processed by a preprocessor:
Separately, Kaneb Services spokesman/nn said/vb holders/nn of its Class A preferred/jj stock/nn failed/vb to elect two directors to the company/nn board/nn when the annual/jj meeting/nn resumed/vb Tuesday because there are questions as to the validity of the proxies/nn submitted/vb for review by the group. PA1 The company/nn adjourned/vb its annual/jj meeting/nn May 12 to allow/vb time/nn for negotiations and expressed/vb concern/nn ab out future/jj actions/nn by preferred/jj holders/nn. PA1 1. the preferred stock raised PA1 2. he expressed concern about PA1 1. and preferred stock sold yesterday was . . . PA1 2. and expressed concern about . . . *period*
The problem which the present invention is intended to solve is the classification of content-word pairs into one of the following three categories.
______________________________________ 1. and expressed/VB concern/NN about 2. Services spokesman/NN said/VB holders 3. class A preferred/JJ stock/NN *comma* ______________________________________
The constructs expressed concern and spokesman said must be tagged verb-object and noun-verb respectively. Preferred stock, on the other hand, must be identified and tagged as a fixed adjective-noun construct .
The complex scope of the pre-processing task is best illustrated by the input to the preprocessor shown below.
______________________________________ Kaneb NM Services NN VB spokesman NN said JJ VB holders NN of PP its DT Class JJ NN A DT JJ preferred JJ VB stock NN VB failed AD VB to PP elect VB two JJ NN directors NN to PP the DT company NN board NN VB when CC annual JJ meeting NN VB resumed JJ VB tuesday NM questions NN VB validity NN proxies NN submitted JJ VB group NN VB ______________________________________
This lexical analysis of the sentence is based on the Collins on-line dictionary plus morphology. Each word is associated with candidate parts of speech, and almost all words .are ambiguous. The tagger's task is to resolve the ambiguity.
A program can bring to bear 3 types of clues in resolving part-of-speech ambiguity. The first is local context. Consider the following 2 cases where local context dominates:
The words the and he dictate that preferred and expressed are adjective and verb respectively. This kind of inference, due to its local nature, is captured and propagated by the preprocessor.
The second clue is global context. Global-sentence constraints are shown by the following two examples:
In case 1, a main verb is found (i.e., was), and preferred is taken as an adjective; in case 2, a main verb is not found, and therefore expressed itself is taken as the main verb. This kind of ambiguity requires full-fledged unification, and it is not handled by the preprocessor. Fortunately, only a small percent of the cases (in newspaper stories) depend on global reading. The third type of due is corpus analysis and is described in R. Beckwith, "Wordnet: A Lexical Database Organized in Psycholinguistic Principles" in Lexical Acquisition: Exploiting On-Line Dictionary to Build a Lexicon, Lawrence Erlbaum Assoc., 1991.