The present invention is directed to a system and method for determining a relationship (e.g., similarity in meaning) between two portions of text. More specifically, the present invention is related to matching a textual input to a portion of a lexical knowledge base and exploiting the results of that match.
A number of application areas currently being developed attempt to “understand” the meaning of a textual input. Examples of such areas include information retrieval, machine translation, conversational and user interfaces, and natural language processing. For instance, when a user of an information retrieval system enters a query, it is helpful for the information retrieval system to not only exactly match words in a query against words in the database being searched, but also understand the meaning of the query so that it could identify documents related to that meaning, but possibly not using the exact same words.
Similarly, in machine translation systems, some words or phrases do not translate exactly from one language to another. In those instances, it would be helpful for the translation system to understand the meaning of the text to be translated such that the translation can be accomplished more accurately.
In conversational or user interfaces, some computers attempt to receive a natural language input (either spoken in which case a speech recognition component is also utilized, or in written or typed form), recognize that input, and take a designated action based on the input. Of course, there is a wide variety of inputs which can be received from different users, all of which mean the same thing. Therefore, this and many other applications for natural language processing techniques can benefit from understanding the meaning of a textual input.
While understanding the meaning of a textual input is highly beneficial, the industry has a number of problems in attempting to develop such systems.
The first problem is related to paraphrase identification. Systems have great difficulty in understanding the input text well enough to recognize paraphrase relationships between text segments that “mean the same thing” even though they may not precisely share content words or syntactic structures.
A problem which is closely related to paraphrase identification is that of word sense disambiguation. Nearly all words are polysemous, to some extent, in that they have different shades of meaning, depending on the context in which they are used. These different shades of meaning are often aligned on a continuum so that one shade of meaning blends into another, based on subtle shifts in context. Word sense disambiguation involves the assignment of one or more senses to an ambiguous word in an utterance. The traditional approach to word sense disambiguation is to create a number of “buckets” for each polysemous word, each bucket labeled with an individual sense number. Once the buckets are created, the analyzer attempts to place the word in the textual input in a given bucket, depending on a number of predetermined criteria. This strategy encounters great difficulty in applications such as information retrieval and machine translation, both because (1) word uses do not always conform to these discrete bins, and (2) state-of-the-art disambiguation techniques are quite unreliable.
Another related problem is the requirement that software application developers be allowed to easily customize their applications to accept natural language inputs in order to implement a conversational interface. Traditionally, such developers have simply attempted to think of every possible way that a user may specify or request an application function. When a user input is received, the system attempts to match the user input against one of the possibilities generated by the developer. The problem of matching natural language into program functions is a fundamental one. However, even given its importance, developers cannot exhaustively specify the wide range of utterances users might use to command the application, and how these utterances should map into function calls.
Some methodologies which have been used in the past for paraphrase identification include dictionary-based, example-based, and corpus-based systems. Dictionary based work has focused mainly on the creation of lexical knowledge bases, specifically taxonomies, from machine readable dictionaries. The aim of most example-based research has been to create large example bases, or collections of example relations or phrases, and to develop methods for matching of incoming text to the stored examples. Corpus-based research efforts have used quantitative analyses of text corpora to develop statistical models of the relationships between words, including simple co-occurrence as well as deeper similarity relationships.