1. Field of the Invention
This invention relates generally to text processing, and, more particularly, to graph-based ranking for text processing.
2. Description of the Related Art
Automated natural language processing techniques may be used to perform a variety of tasks, including word sense disambiguation, keyphrase extraction, sentence extraction, automatic summarization of text, and the like.
Word sense disambiguation is a technique for assigning the most appropriate meaning to a polysemous word within a given context. Word sense disambiguation is considered essential for applications that use knowledge of word meanings in open text, such as machine translation, knowledge acquisition, information retrieval, and information extraction. Accordingly, word sense disambiguation may be used by many commercial applications, such as automatic machine translation (e.g. see the translation services offered by www.altavista.com, www.google.com), intelligent information retrieval (helping the users of search engines find information that is more relevant to their search), text classification, and others.
Conventional techniques for word sense disambiguation have concentrated on supervised learning, where each sense-tagged occurrence of a particular word is transformed into a feature vector, which is then used in an automatic learning process. However, the applicability of such supervised algorithms is limited only to those few words for which sense tagged data is available, and their accuracy is strongly connected to the amount of labeled data available at hand. Open-text knowledge-based approaches for word sense disambiguation have received significantly less attention. While the performance of such knowledge intensive methods is usually exceeded by their corpus-based alternatives, they have however the advantage of providing larger coverage. Knowledge-based methods for word sense disambiguation are usually applicable to all words in open text, while corpus-based techniques target only few selected words for which large corpora are made available. Four main types of knowledge-based methods have been developed for word sense disambiguation: Lesk algorithms, semantic similarity, local context, selectional preference, and heuristic-based methods.
Keyphrase extraction may be used for automatic indexing (e.g. indexing terms for books, which may be much needed in libraries, or by other cataloging services), terminology extraction, or as input to other applications that require knowledge of what are the important keywords in a text, e.g. word sense disambiguation or text classification. The task of a keyword extraction application is to automatically identify a set of terms that best describe a text. Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document. Moreover, a system for automatic identification of important terms in a text can be used for the problem of terminology extraction, and construction of domain-specific dictionaries. The same algorithm can be applied for term extraction (e.g. to extract important terms in medical literature), or for producing short summaries of large texts.
One conventional technique for keyword extraction uses a frequency criterion to select the “important” keywords in a document. However, this method was generally found to lead to poor results, and consequently other methods were explored. Supervised learning methods, where a system is trained to recognize keywords in a text, based on lexical and syntactic features typically provide better results than the frequency criterion. In this technique, parameterized heuristic rules are combined with a genetic algorithm to form a system for keyphrase extraction that automatically identifies keywords in a document. One known supervised learning method is called GenEx. A learning algorithm that applies a Naive Bayes learning scheme to the document collection achieves improved results when applied to the same data set as used by the GenEx algorithm. A 29.0% precision is typically achieved with GenEx for five keyphrases extracted per document and an 18.3% precision achieved by the Naive Bayes learning scheme for fifteen keyphrases per document.
The performance of supervised learning system can be improved by incorporating additional information or limiting the type of document. For example, when a supervised learning system is applied to keyword extraction from abstracts using a combination of lexical and syntactic features, accuracy may improve over previously published results. Keyword extraction from abstracts is more widely applicable than from full texts, since many documents on the Internet are not available as full-texts, but only as abstracts. Integrating part of speech information into the learning process may also improve the performance of supervised learning algorithms. The accuracy of the system may also be increased by adding linguistic knowledge to the term representation.
Various algorithms for sentence extraction and/or automatic summarization of text have also been proposed. With the huge amount of information available these days, the task of automatic summarization is becoming increasingly important. Sentence extraction and/or automatic summarization may be of high interest for many companies or other agencies dealing with large amounts of data. For example, government agencies may use these techniques to summarize the huge volume of messages they receive daily. Search engines may use them to provide users with concise summaries of the documents found by user searches and news agencies may use them to build abstracts for the everyday news.
Conventional natural language processing algorithms do not, however, utilize graph-based ranking algorithms, at least in part because of the difficulty of determining an appropriate graphing scheme.
The present invention is directed to addressing the effects of one or more of the problems set forth above.