The desirability of generating a summary of a document, such as an abstract, is well known. A more difficult task, yet equally desirable, is that of providing a summary of multiple documents in a document collection which are directed to a common event, person, theme and the like. Generally, such a collection of documents can span numerous sources, ranges in time and focus. The ability to generate a readable summary which conveys the content of the document collection is important to enable researchers to determine if the collection of documents pertains to the research question at hand.
A number of methods for generating a summary of multiple related documents have been considered. For example, the MultiGen system developed by Barzilay et al. and available from Columbia University, Department of Computer Science, New York, N.Y., is a known system which performs well at generating summaries of a set of documents which are closely related, such as documents concerning a single event. While the performance of the MultiGen system is suitable for use with documents which are closely related, this system was not intended to generate summaries for document collections which are less closely related, such as collections of documents addressing multiple events, issues and biographical documents. Documents in these forms of diverse collections present additional challenges for generating readable, meaningful summaries.
One important application for multidocument summarization is in the area of summarizing news stories published from multiple sources. In this regard, it would be useful to have a system which could gather news stories from a number of sources, group these stories into clusters of related documents and then generate a readable summary of the documents in the cluster. Such a system would enable a user to browse large quantities of content quickly and efficiently.