The present invention relates to a method of and an apparatus for processing an input text. The present invention also relates to a method of and an apparatus for performing an approximate translation. The invention further relates to a storage medium. Such methods and apparatuses may be used in natural language processing, document processing and text processing. For instance, such methods and apparatuses may be used as a glossing system which provides translations of words or groups of words in an input text into corresponding words or symbols or groups thereof in a different natural language.
Text in natural languages generally contains words or symbols which are associated with each other to have a meaning which is different from the individual meanings of the words or symbols. Such groups are referred to as xe2x80x9ccollocationsxe2x80x9d and must be identified as such if the text is to be processed correctly, for instance to access an index of a dictionary (monolingual, bilingual or multilingual), thesaurus or encyclopaedia.
There are known systems for analysing input text by parsing ie: analysing a sentence to determine the relationship between the words. The use of parsing is effective in optimally labelling a sentence with its collocations. However, this technique generally involves superfluous processing and is computationally complex. This technique also requires a vast amount of knowledge e.g. grammar rules and semantic constraints that related words exert upon each other, to drive it.
Another known technique finds the biggest continuous collocation, where xe2x80x9ccontinuousxe2x80x9d in this context means that the words of the collocation are adjacent to each other in the input text. However, such techniques cannot distinguish between collocations of the same length. For instance, in the sentence xe2x80x9cAir passes out of the furnace through a pipe.xe2x80x9d, there are two collocations each having two words, namely xe2x80x9cpasses outxe2x80x9d and xe2x80x9cout ofxe2x80x99. This technique cannot decide which of these collocations should be chosen.
A known technique for finding discontinuous collocations is disclosed in EP 0 637 805. This technique uses a part-of-speech tagger to attempt to select the best collocations from input text. Such a technique helps to distinguish between xe2x80x9cbus stopsxe2x80x9d where xe2x80x9cstopsxe2x80x9d is a noun, and xe2x80x9cstops atxe2x80x9d where xe2x80x9cstopsxe2x80x9d is a verb in the sentence xe2x80x9cthe bus stops at Grenoblexe2x80x9d. However, this technique is not capable of indicating which of these possible collocations is optimal. Further, the technique does not provide a means for finding a consistent labelling of collocations for a sentence.
Although these techniques can determine without inconsistency collocations which do not share the same word from an input text, they cannot identify which is the optimal collocation where two or more possible collocations have one or more words in common. As the above examples illustrate, it is essential to select with a high degree of reliability the correct collocation if the collocation is required to be used, for instance to access an index such as a dictionary.
According to a first aspect of the invention, there is provided a method of processing an input text comprising a plurality of words, the method comprising the steps of:
deriving from the input text a plurality of sets such that each set comprises at least one of the words of the input text, all of the words of each set are present in the input text, and the words of each if any set containing more than one word constitute a collocation;
assigning to each set a unique relative rank;
comparing each set in order of decreasing relative rank with the input text; and
selecting each set, all of whose words are present in the input text and none of whose words are present in a previously selected set of higher relative rank.
Each of the words of the input text may be present in at least one of the sets.
All of the words of the input text may be present in the union of the selected sets. The term xe2x80x9cunionxe2x80x9d is used in its conventional mathematical sense and means a set containing all the words of the selected sets.
The input text may comprise a grammatically complete sample of text.
The words may comprise basic word forms derived from an original text by linguistic (e.g. morphological) analysis in a preliminary step.
The assigning step may comprise a first step of assigning a priority value which increases with increasing number of words in the set.
The assigning step may comprise a second step of assigning a priority value which decreases with increasing span of the words of the set in the input text. The term xe2x80x9cspanxe2x80x9d means the number of words, including the words of the set themselves, between the word of the set which occurs first in the input text and the word of the set which occurs last in the input text.
The second step may be performed only if the first step results in more than one set having the same priority value.
The assigning step may comprise a third step of assigning a priority value which is dependent on the linguistic relationship between at least one word of the set and at least one word of the input text not in the set.
The third step may be performed only if the second step results in more than one set having the same priority value.
The assigning step may comprise a fourth step of assigning a priority value which increases with position to the right in the input text of the right-most word of the set. This is appropriate for languages such as English which tend to be right-branching.
The fourth step may be performed only if the third step results in more than one set having the same priority value.
The assigning step may comprise a fifth step of assigning a priority value by default.
The fifth step may be performed only if the fourth step results in more than one set having the same priority value.
The assigning step may comprise assigning a priority value based on a measure of probability of each set.
The method may comprise accessing an index of word sets with at least one of the selected sets.
According to a second aspect of the invention, there is provided a method of performing an approximate translation of an input text in a first natural language to a second natural language, comprising performing a method in accordance with the first aspect of the invention, in which the index is a dictionary, such as a bilingual dictionary, and outputting dictionary entries in the second language corresponding to the selected sets.
The first and second languages may be the same language but more usually are different languages.
According to a third aspect of the invention, there is provided an apparatus for processing an input text comprising a plurality of words, the apparatus comprising:
means for deriving from the input text a plurality of sets such that each set comprises at least one of the words of the input text, all of the words of each set are present in the input text, and the words of each if any set containing more than one word constitute a collocation;
means for assigning to each set a unique relative rank;
means for comparing each set in order of decreasing relative rank; with the input text; and
means for selecting each set, all of whose words are present in the input text and none of whose words is present in a previously selected set of higher relative rank;.
The deriving means may be arranged such that each of the words of the input text is present in at least one of the sets.
The selecting means may be arranged such that all of the words of the input text are present in the union of the selected sets.
The input text may comprise a grammatically complete sample of text delimited by punctuation, such as full stops, semi-colons or colons. Examples of such samples are phrases, clauses and sentences.
The words may comprise basic word forms and the apparatus may comprise a linguistic analyser for analysing an original text and providing the basic word forms.
The assigning means may comprise first means for assigning a priority value which increases with increasing number of words in the set.
The assigning means may comprise second means for assigning a priority value which decreases with increasing span of the words of the set in the input text.
The second means may be enabled only if the first means assigns the same priority value to more than one set.
The assigning means may comprise third means for assigning a priority value which is dependent on the linguistic relationship between at least one word of the set and at least one word of the input text not in the set.
The third means may be enabled only if the second means assigns the same priority value to more than one set.
The assigning means may comprise fourth means for assigning a priority value which increases with position to the right in the input text of the right-most word of the set.
The fourth means may be enabled only if the third means assigns the same priority value to more than one set.
The assigning means may comprise fifth means for assigning a priority value by default.
The fifth means may be enabled only if the fourth means assigns the same priority value to more than one set.
The assigning means may be arranged to assign a priority value based on a measure of probability for each set.
The apparatus may comprise a store containing an index of word sets and means for accessing the index with at least one of the selected sets.
According to a fourth aspect of the invention, there is provided an apparatus for performing an approximate translation from an input text in a first natural language to a second natural language, comprising an apparatus in accordance with the third aspect of the invention, a store containing entries constituting a dictionary, and means for accessing the bilingual dictionary with at least one of the selected sets.
The apparatus according to the first or second aspect of the invention may comprise a programmed data processor.
According to a fifth aspect of the invention, there is provided a storage medium containing a program for a data processor of an apparatus according to the third or fourth aspect of the invention.
It is thus possible to provide a technique which allows optimal collocations to be selected. In the case where there are two or more candidates for the correct collocation and the candidates all contain the same word, this technique allows the correct candidate to be selected with improved reliability.
These methods and apparatuses are generally performed by or embodied by a programmed data processor such as a computer. The technique is computationally economical and requires much less computing time and resources than the known parsing technique. For instance, this technique allows optimal collocation selection in a time of the order of (n log n) (where n is the number of equivalences before sorting as described hereinafter), whereas parsing requires a time of the order of n3. Although continuous collocation detection requires a time of the order of n, it cannot distinguish between collocations of the same length (as mentioned hereinafter) and gives poor results.