The present invention generally relates to a method and system for generating a lexicon of cooccurrence relations in a natural language. More particularly, the present invention concerns technology for generating and maintaining a cooccurrence relation lexicon describing cooccurrence relations among words, phrases and others and which can be utilized not only in a natural language parsing system for analyzing sentences or clauses described in a natural language but also in a translation system for performing translation between different natural languages on the basis of the results of the parsing.
As apparatus and systems for parsing sentences and clauses described in a natural language and making use of the results of the parsing for translation or for other purposes, there have heretofore been developed a question-answer system, an automatic indexing system and a machine translation system which can operate on the natural languages. In the field of this technology, the main theme of studies has been the parsing for recognition of sentences or clauses. In a simple form of the parsing, a template sentence or a semi-template sentence which corresponds to a template sentence having a variable such as, for example, "PLEASE GIVE ME * TICKETS" (where * represents a variable indicating the number of tickets in this example) is collated with an input sentence, wherein detection of coincidence between the template or semi-template sentence and the input sentence allows an output sentence such as "INPUT SENTENCE COULD BE RECOGNIZED" to be issued. In the syntactic analysis in which a more general parsing method is adopted, the subject, predicate, modifying phrase and others which constitute a sentence are recognized.
In the syntactic analysis mentioned above, difficulty is encountered in parsing a partial blank sentence having a blank portion to be filled such as, for example, "SOMEBODY SAID THAT . . . ". Similarly, in the parsing of a sentence containing a plurality of modifiers, it is extremely difficult to find out what a word, phrase or clause in concern modifies. In conjunction with the parsing of an English sentence, for example, it is known that a sentence composed of a subject, a predicate and an object allows five alternatives to run candidates for the parsed sentence with addition of two prepositions and as many as fourteen alternatives with three prepositions. For avoiding the ambiguity as involved, it has been proposed that semantic restrictions should be imposed on the parsing. By way of example, consider a phrase "A BUILDING OF WHITE WALL STANDING BY A LAKE". This phrase may be syntactically analyzed into a string of words "WHITE WALL STANDS BY A LAKE" and a word "BUILDING" attached thereto as one hypothesis. To exclude such hypothesis, a semantic restriction rule to the effect that "MATERIAL (white wall) OF `OF MATERIAL` CAN NOT BE THE SUBJECT OF THE POSSESSIVE CASE" or alternatively a word-based selectional (restriction) rule to the effect that "WHITE WALL CAN NOT STAND" but "BUILDING CAN STAND" may be established. Under the restriction, the above phrase can be syntactically interpreted or analyzed to read "(BUILDING OF WHITE WALL) STANDING (BY A LAKE)".
In this concentration, it is observed that a certain word occurs in a certain sentence together with another certain word in a certain relationship with a high probability or high frequency. In that case, it can be said that both words share cooccurrence relation with each other. As instances exemplifying the cooccurrence relation, there may be mentioned English idioms such as "TAKE A BATH", the government of prepositions by verbs typified by "GET OUT", the adverbial concord or collocation such as ". . . NOT . . . AT ALL", and others. As literatures describing linguistically these cooccurrence relations in detail, there exist dictionaries of collocations. For example, reference may be made to S. Katsumata's "Kenkyusha's New Dictionary Of English Collocations" (1958, Second Edition) and "Longman Dictionary of English Idioms". These dictionaries are however destined for use by those people having knowledge and experience in the various fields in addition to the linguistics. Further, these dictionaries simply enumerate fragmentary instances in accordance with a certain sequence. In other words, the dictionaries can not be straightforwardly utilized for setting up rules useful in the syntactic analysis or parsing.
With a view toward making the linguistic knowledgeable to be utilized in machine processing such as parsing, formatting the knowledge in the form of tables and rules has been developed and proposed. Further, as an aid to this end, a method for analyzing or extracting the cooccurrence relations is proposed according to which a set of sentences each including a word in concern are outputted in the form of a list to determine or check how the word in concern is used in the sentences. Such method is known as a KWIC (Key Word in Context) method. However, even with the aid of the KWIC method, a test as to whether the restriction rules and grammar are observed can not be made without resorting to the user's judgment.
In conjunction with the procedure or regulating the cooccurrence relations for utilization in syntactic analysis or parsing, it is required to determine previously what types of cooccurrence relations are to be set up (usually tabulated) or where and how a given cooccurrence relation is made use of in the course of the parsing. As a consequence, the parsing process assumes a fixed routine lacking in flexibility and giving rise to problems. Further, because data or information of the cooccurrence relations is only available through the medium of the record tables, there may occur such a situation in which information required for a given parsing can not be available. In that case, preparation of information requisite for establishment of new cooccurrence relations as well as addition/deletion and modification of the cooccurrence relation table must rely on man power, which requires a number of laborious processing steps.
As the known literature concerning the machine translation in which a lexicon of cooccurrence relations is made use of, there may be mentioned, for example, Muraki et al "Semantic Processing in Machine Translation System Using PROLOG" contained in "Natural Language Processing Study Reports 33-5" published by Information Processing Society of Japan (Oct. 22, 1982) and Pierre Isabelle et al "TAUM-AVIATION: Its Technical Features" in Computational Linguistics, Vol. 11, No. 1, January-March 1985, pp. +18.