Field of the Invention
Embodiments of the present invention relate generally to computer science and, more specifically, to techniques for interpreting text based on what words and phrases mean in context.
Description of the Related Art
Natural language processing is essential for intelligent analysis of text. A wide variety of text mining applications (e.g., searching, document correlation, summarization, translation, etc.) rely on identifying meaningful patterns in text segments. In particular, targeting content to match user preferences often leverages pattern analysis of web page accesses and searches previously performed by the user.
In one approach to identifying patterns in text segments, key-word matching algorithms conflate words (sequences of characters) with meaning, ignoring the ambiguous and context-dependent nature of language. For example, the word “chair” has multiple word senses (i.e., meanings), including “a position of a professor in an academic institution” and “a seat for one person with a support for the back.” Because key-word based algorithms do not differentiate between word senses, such approaches often misinterpret text segments and lead to irrelevant results. Consequently, a document comparison algorithm based on key-word matching might identify that a user manual for debugging a mouse (input device) and a description of trapping a mouse (rodent) are relatively similar.
In an effort to increase the reliability of text interpretation techniques, other approaches leverage statistical analysis algorithms to “guess” word senses. In one technique, applications select word senses based on the frequency distribution of the words in the text segment. While such an approach often produces more sensible results than a purely key-word based approach, statistical analysis is unreliable, particularly across genres and domains. For instance, the statistical word-meaning frequency of the word-meaning combination “resistance: opposition to the flow of electrical current” in electrical engineering textbooks is much higher than the statistical frequency of the word-meaning combination “resistance: the attempt to prevent something,” that is prevalent in history texts. A more serious shortcoming of statistical techniques is that probabilities become less reliable and discriminate when you process and index hundreds of millions of pages across all domains.
As the foregoing illustrates, what is needed in the art are more effective techniques for interpreting text segments in computer-based implementations where word and phrase meanings are important.