A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to automatic processing of human language. More especially, the present invention relates to systems and methods for automatically determining the semantic similarity of different sentences to one another.
Consider a sentence-matching machine that can accept an input sentence of human language (for example, a sentence of Chinese-language text) and then automatically find one or more sentences, from a database of sentences, that most closely match the input sentence in some sense. Such a sentence-matching machine would be useful. A lighthearted example of a desired use for a sentence-matching machine is as follows. A person expresses an interesting thought in a sentence and then inputs that sentence into the sentence-matching machine. The person wishes the machine to compare the sentence with sentences from a book of famous quotations to see if anyone famous has previously expressed the same or similar thought before in a sentence quoted in the book.
Certain types of sentence-matching machines do exist. Unfortunately, the conventional sentence-matching machine is not well suited to perform the task sought by the person in the above-mentioned example. The reason is that different people, or even the same person on different occasions, are likely to express any given idea using different sets of words. The conventional sentence-matching machine, would not be good at recognizing semantically similar sentences if sufficiently non-overlapping sets of words are used in the sentences.
What is needed is a sentence-matching system and associated methods that can make good judgments of semantic similarity among word sets, e.g., sentences, even when word sets that are semantically similar do not necessarily have words in common. What is especially needed is such a sentence-matching system and associated methods that are operative on word sets that include words of Chinese, or similar languages. The present invention satisfies these and other needs.
A system and associated methods determine the semantic similarity of different sentences to one another. A particularly appropriate application of the present invention is to automatic processing of Chinese-language text, for example, for document retrieval. According to one aspect of the invention, a method for computing the similarity between a first and a second set of words includes identifying a word of the second set of words as being most similar to a word of the first set of words, wherein the word of the second set of words need not be identical to the word of the first set of words; and computing a score of the similarity between the first and second set of words based at least in part on the word of the second set of words. According to another aspect of the invention, a system for computing the similarity between a first set of words and a second set of words includes means for identifying a word of the second set of words as being most similar to a word of the first set of words, wherein the word of the second set of words need not be identical to the word of the first set of words; and means for computing a score of the similarity between the first and second set of words based at least in part on the word of the second set of words.