The present invention relates to systems and methods for parsing languages in open-ended semantic domains.
Prior parsing systems are directed mainly toward computer languages which tend to be structured to minimize ambiguity. Consequently such parsing systems have very limited applicability to natural languages, such as English, that are replete with ambiguous, yet valid, constructions. These prior parsing systems for computer languages are described in Aho, Alfred V., et al., Principles of Compiler Design, Massachusetts, Addison-Wesley, 1979.
Other approaches for parsing natural languages are described in Sanger, Naomi, Natural Language Information Processing, Massachusetts, Addison-Wesley, 1981 and King, Margaret, ed., Parsing Natural Language, New York, Academic Press, 1983. The systems described in these works typically employ syntax-driven approaches similar to those used in parsing computer languages that emphasize meaning extraction in a limited semantic domain, i.e., a limited set of syntactic structures and a limited vocabulary. Where a parsing system emphasizes a syntactic approach, it attempts to derive the "true" parse, i.e., to determine the true underlying grammatical structure of a sentence.
In contrast, the present system and method does not attempt to check the grammatical acceptability of the underlying structure of the sentence. The present approach is based on the appreciation that simpler, more superficial structures are adequate for describing most of English, most of the time. Such simpler structures can be handled in less time with less memory than the prior art approaches, therefore the present approach is more acceptable to users of small computers or word processors.
An approach to natural language parsing that was used by systems in the prior art involved a 15,000-word dictionary addressed by a grammar program module and was based on the notion of triples of tags. Tags identifying the grammatical roles words can play were assigned to each entry in the dictionary and a list of accept-able triple of tags, such as "article,adjective,singular-noun" and "adjective,singular-noun,singular-verb," was created. Thousands of these triples were needed to describe text generated even by young children. A sentence was found acceptable by the grammar program module if it was possible to string together overlapping triples from the list to match the list of tags provided from the dictionary. For example, the sentence "the yellow cat jumps" would be accepted by the grammar module with the two triples cited above.
This approach to grammar was found to be inadequate because the larger number of triples needed to describe the wide variety of acceptable grammatical structures required too much memory and many sentences with glaring errors were inevitably accepted by the grammar module. For example, errors of agreement between words separated by more than a few intermediaries cannot be caught by such a model. This triples approach has been used in the approach to grammatical constraints in speech recognition employed by the International Business Machines Corporation.
Among other prior text processing systems, U.S. Pat. No. 3,704,345 discloses a system for converting printed text into speech sounds. The system includes a syntax analyzer that consults a phoneme dictionary, choosing a grammatical category for each word in an input sequence and assigning a phrase category to each word. The syntax analyzer may realize a logic tree representing each state of a sentence with each branch in the tree being matched to a word in the input sequence. Also, the decision logic may be implemented in a computer program operating as a matrix in which rows represent predetermined states of a sentence and columns represent the word class to be incorporated into the sentence. This type of system is limited in the variety of sentences it can process successfully because of the rigidly defined logic tree embodied in its syntax analyzer.
Other references in the prior art disclosing systems for checking or correcting the spellings of input character strings are U.S. Pat. No. 4,674,066, U.S. Pat. No. 4,580,241, U.S. Pat. No. 4,498,148, U.S. Pat. No. 4,456,969, U.S. Pat. No. 4,383,307, U.S. Pat. No. 4,355,371, U.S. Pat. No. 4,136,395, and U.S. Pat. No. 3,969,698. Some of these systems address compressed or efficiently packed dictionary lookup tables in processing words which may be spelled incorrectly. Other prior art references describing compressed dictionary tables or packing techniques are U.S. Pat. No. 4,355,302, U.S. Pat. No. 4,342,085, U.S. Pat. No. 4,010,445 and U.S. Pat. No. 3,995,254.
The prior parsing systems have had the great disadvantages of requiring large amounts of computer memory and processing time for their operation, failing to locate many common errors and locating as errors many text passages which are in fact correct. These systems are incapable of achieving a first object of the present invention, i.e., on-the-fly operation in which the system makes as much progress as possible in parsing the input text as each character is input.
Another object of the present invention is to provide a parsing system which is operable with text input from a keyboard, voice entry, or similar device and text previously stored in a suitable memory.
A further object of the invention is to provide a parsing system which makes efficient use of program modules and lookup tables, permitting use with commonly available personal computers or word processing typewriters.
A still further object of the invention is to provide a parsing system having great flexibility, permitting processing of text written in one of a plurality of languages.