The present invention relates to the field of document management, and more particularly, to a system for summarizing documents that uses information about a document""s genre, or document type, for selecting summary sentences for an automatically generated summary.
A user faced with a huge document or a collection of documents typically wants to obtain a summary of the documents in order to save time or to answer a specific question. The task of summarizing a document involves finding a small number of sentences that provide a concise characterization of the document. Existing approaches for summarizing documents apply only one summarization strategy, thus ignoring variations in the structure and wording of different genres of documents. Some examples of different document genres include newspaper articles, editorials, reference manuals, scientific works and tutorials. One problem with existing approaches is they can be slow and inaccurate when applied to heterogeneous document collections. A heterogeneous document collection includes documents of different genres, or document types such as fiction, scientific or other non-fiction works, etc.
The present invention provides a system for genre-specific summarization of documents. The system of the present invention overcomes the problem of summarizing heterogeneous document collections by taking the genre, or type, of document into account when selecting summary sentences. We have discovered that one problem with applying known document summarization techniques to heterogeneous collections is that the assumptions made by such techniques may not apply across the population of the collection. Such assumptions include where in a document sentences which contain summary information might be located, keywords which may indicate summary information, etc. By taking genre into account, the system of the present invention takes advantage of the structure and wording of various document genres to provide faster and more accurate summaries. For example, document genres such as newspaper articles tend to have good summary sentences in the beginning and document genres such as research papers tend to have good summary sentences in the conclusion. The system of the present invention takes this information into account when selecting summary sentences.