The present invention relates to identifying and retrieving text. More specifically, the present invention relates to identifying and retrieving text portions (or text fragments) of interest from a larger corpus of textual material by generating a graph covering the textual material and scoring portions of the graph.
There are a wide variety of applications which would benefit from the ability to identify text of interest in a larger text corpus. For instance, document clustering and document summarization both attempt to identify concepts associated with documents. Those concepts are used to cluster the documents into clusters, or to summarize the documents. In fact, some attempts have been made to both cluster documents and summarize an entire cluster of documents, automatically, for use in later processing (such as information retrieval).
Prior systems have attempted to order sentences based on how related they are to the concept or subject of a document. The sentences are then compressed and sometimes slightly rewritten to obtain a summary.
In the past, sentence ordering has been attempted in a number of different ways. Some prior systems attempt to order sentences based on verb specificity. Other approaches have attempted to order sentences using heuristics that are based on the sentence position in the document and the frequency of entities identified in the sentence.
All such prior systems have certain disadvantages. For instance, all such prior systems are largely extractive. The systems simply extract words and sentence fragments from the documents being summarized. The words and word order are not changed. Instead, the words or sentence fragments are simply provided, as written in the original document, and in the original order that they appear in the original document, as a summary for the document. Of course, it can be difficult for humans to decipher the meaning of such text fragments.
In addition, most prior approaches have identified words or text fragments of interest by computing a score for each word in the text based on term frequency. The technique which is predominantly used in prior systems in order to compute such a score is the term frequency*inverse document frequency (tf*idf) function, which is well known and documented in the art. Some prior systems used minor variations of the tf*idf function, but all algorithms using the tf*idf class of functions are word-based.
In another area of technology, graphs have been built in order to rank web pages. The graphs are ranked using a hub and authorities algorithm that uses the web pages as nodes in the graph and links to the web page as links in the graph. Such graphing algorithms have not been applied to graph text.