Field of the Invention
The present invention relates generally to computational linguistics and, more specifically, to summarizing collections of documents.
Description of the Related Art
Often people wish to make summaries based on information contained in, and distributed among, relatively large collections of documents, e.g., substantially more documents than they have time to read or the cognitive capacity to analyze. The main idea of summarization is to find a representative subset of the data, which contains the information of the entire set. Automatic document summarization can be implemented as the process of selecting a few representative features from amongst the many features expressed by documents in the collection. By selecting only a few representative features, summarization allows developing insights about information contained in documents without having to manually digest all of the information, which is typically more voluminous than the summary.
However, many existing document summarization techniques produce results of poor quality. For example, some techniques applied to text documents select words with highest frequencies over documents in the collection. Such techniques would likely select common words that carry little semantic information about the collection, such as “the,” “and,” and “or.” Other techniques applied to text documents compare word frequencies among documents within the collection to frequencies outside of the collection, to select words that are over-represented within the collection. Those methods often tend to favor excessively rare words that are either highly specialized or so uncommon that their occurrence can be attributed to chance, and that, in either case, are not representative of the collection of documents at large.