The invention, in some embodiments, relates to understanding of the meaning of a written text, and more particularly to methods and systems which enable understanding of a text in a first language using a translation of the text into a second language.
The problem of understanding text by computers is well known. For example, full understanding of an input text is required in order for a computer to automatically generate an output text in another language, such that the output text has the exact same meaning as the input text, in a process known as “machine translation”. Another use for automatic understanding of a written text is the extraction of knowledge from the text, for example for generating recommendations of content to a TV subscriber based on textual summaries provided by content providers, which requires full understanding of the input text in order correctly judge what content is expected to be of interest to the subscriber.
A naïve approach to the task of understanding text would be to lookup each word of the text in a dictionary. Unfortunately, this approach often results in an inaccurate translation, as it disregards the existence of ambiguous words, which have multiple distinct meanings. For example the word “numbered” has multiple meanings, including, (i) marked by a number, as in “The pages of this book are numbered”, and (ii) limited in number, as in “The old man's days are numbered”. In some cases, the context of the full sentence in which the ambiguous word appears may indicate which of the multiple interpretations, or meanings, of the ambiguous word, is the correct one for understanding the text. However, there are cases in which it is impossible to determine, based on the context, which of two meanings is the correct one. For example, in the sentence “The days of the paper calendar are numbered” it is impossible to tell whether the meaning is that the day of the paper calendar are marked by numbers, or that the days of the paper calendar are limited in number.
Additional examples of ambiguous English sentences, containing words having multiple meanings, which cannot be resolved from their local context, include: “Being in debt attracts a lot of interest from bankers”, “The invention of the wheel created a revolution”, and “Airlines process lost luggage on a case by case basis”. (All the above ambiguous sentences are taken from http://www.businessballs.com/puns-double-meanings.htm).
The problem of ambiguous words with multiple meanings is well known and many solutions for it have been proposed. Many of the proposed solutions rely on a statistical approach, in which a large amount of text is analyzed, in advance, to determine relative frequencies of each of the potential meanings of each ambiguous word by counting occurrences of these meanings and storing the counts in a database. Subsequently, when encountering the ambiguous word in a text to be understood, it is assumed that the meaning of the word is the most frequently appearing meaning according to the advanced analysis, thus providing a high probability of selecting the correct meaning of the ambiguous word.
An improvement to the statistical approach described above, which adds context analysis to the statistical approach, is disclosed in http://anthology.aclweb.org/P/P91/P91-1017.pdf. In accordance with this method, rather than counting occurrences of single words, the improved method counts occurrences of pairs of words that appear to be related to each other in the analyzed text, and then selects the most frequent interpretation for the pair of words. In the example provided above, of “The days of the paper calendar are numbered”, one may count occurrences of the pair {days/numbered} in which the word “numbered” is used with the meaning of “marked by a number” and in which the word “numbered” is used with the meaning of “limited in number”. Most probably we will determine that when combined with the word “days”, the word “numbered” is more frequently used to mean “limited in number” rather than “marked by number”, and therefore the interpretation “limited in number” would be selected.
An additional improvement disclosed in the article mentioned above is to apply the statistical test, not (only) in the language of the analyzed text, but (also) in a second language in which the analyzed word is not ambiguous, and has a single meaning. For example, even though the word “numbered” is ambiguous in English and corresponds to two different meanings, in Hebrew those two meanings are each associated with a different word: “” (pronounced meh-moos-pah-REEM) for “marked by number” and “” (pronounced sfoo-REEM) for “limited in number”. In accordance with this improved method, the occurrences of each of the two Hebrew words (or pairs of words) are counted over a large amount of Hebrew text, and it is assumed that the resulting probabilities measured on the Hebrew text are reliable approximations of the corresponding probabilities of the multiple meanings of the ambiguous word in English. The authors of the article claim that counting the likelihood of a word meaning, or of the meaning of a pair of words, in a different language may provide better results than counting the likelihood of a corresponding word meaning in the language of the original text.
However, all the described solutions for determining the meaning of an ambiguous word, succeed in increasing the likelihood of achieving the correct interpretation of the word, but are far from providing 100% accurate understanding of the text. Additionally, the more successful the solution, the more complex, time consuming, and expensive it is to implement. Methods that rely on statistics of pairs of words require a much more extensive analysis of a large amount of reference text and a much higher number of entries for which occurrence data must be stored.
There is therefore a need in the art for a method for determining the meaning of a text including ambiguous words at a high accuracy rate, while being simple, robust, fast, and inexpensive to implement.