Documents obtained via an electronic medium (i.e., the Internet or on-line services, such as AOL, Compuserve or other services) are often provided in such volume that it is important to be able to summarize them. Oftentimes, it is desired to be able to quickly obtain a brief (i.e., a few sentences or a paragraph length) summary of the document rather than reading it in its completeness. Most typically, such documents span several paragraphs to several pages in length. The present invention concerns itself with this kind of document, hereinafter referred to as average length document.
Present day summarization technologies fall short of delivering fully informative summaries of documents. To some extent, this is so because of shortcomings of the state-of-the-art in natural language processing; in general, the issue of how to customize a summarization procedure for a specific information seeking task is still an open one. However, given the rapidly growing volume of document-based information on-line, the need for any kind of document abstraction mechanism is so great that summarization technologies are beginning to get deployed in real world situations.
The majority of techniques for “summarization”, as applied to average-length documents, fall within two broad categories. A class of techniques mine a document for certain pre-specified pieces of information, typically defined a priori, on the basis of fixing the most characteristic features of a known domain of interest. Other approaches rely, in effect, on ‘re-using’ certain fragments of the original text; these have been identified, typically by some similarity metric, as closest in meaning to the whole document. This categorization is not a rigid one: a number of approaches (as exhibited, for instance, in a recent workshop on Association for Computational Linguistics, “Proceedings of a Workshop on Intelligent, Scalable, Text Summarization,” Madrid, Spain, 1997) use strong notions of topicality (B. Boguraev and C. Kennedy, “Salience-based content characterization of text documents,” in Proceedings of ACL '97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), (E. Hovy and C. Y. Lin, “Automated text summarization in SUMMARIST,” in Proceedings of ACL '97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), lexical chains (R. Barzilay and M. Elhadad, “Using lexical chains for text summarization,” in Proceedings of ACL '97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), and discourse structure (D. Marcu, “From discourse structures to text summaries”, in Proceedings of ACL '97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), (U. Hahn and M. Strube, “Centered segmentation: scaling up the centering model to global discourse structure,” in Proceedings of ACL-EACL/97, 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 1997), thus laying claim to newer sets of methods.
Still, at a certain level of abstraction, all approaches share a fundamental similarity: summarization methods today rely, in essence, on substantial data reduction over the original document source. Such a position leads to several usability questions.
Given the extracted fragments which any particular method has identified as worth preserving, what is an optimal way of encapsulating these into a coherent whole, for presenting to the user? Acknowledging that different information management tasks may require different kinds of summary, even from the same document, how should the data discarded by the reduction process be retained, in case a reference is necessary to a part of the document not originally included in the summary? What are the trade-offs in fixing the granularity of analysis: for instance, are sentences better than paragraphs as information-bearing passages, or are phrases even better? Of particular importance to this invention is the question of “user involvement.” From the end-user's point of view, making judgements, on the basis of a summary, concerning what a document is about and whether to pay it closer attention would engage the user in a sequence of actions: look at the summary, absorb its semantic impact, infer what the document might be about, decide whether to consult the source, somehow call up the full document, and navigate to the point(s) of interest. Given that this introduces a serious amount of cognitive and operational overhead, what are the implications for the user when they are faced with a large, and growing, number of documents to deal with on a daily basis?
These are only some of the questions concerning the acceptability of summarization technology by end users. There is particular urgency, given the currently evolving notion of “information push”, where content arriving unsolicited, and in large quantities, at individual workstations threatens users with real and immediate information overload. To the extent that broad coverage summarization techniques are beginning to get deployed in real world situations, it is still the case that these techniques are based primarily on sentence extraction methods. In such a context, the above questions take on more specific interpretations. Thus, is it appropriate to concatenate together the sentences extracted as representative—especially when they come from disjoint parts of the source document? What could be done, within a sentence extraction framework, to ensure that all ‘themes’ in a document get represented by the set of sentences identified by the technology? How can the jarring effect of ‘dangling’ (and unresolved) references in the selection—without any obvious means of identifying the referents in the original text—be overcome? What mechanisms could be developed for offering the user additional information from the document, for more focused attention to detail? What is the value of the sentence, as a basic information-bearing unit, as a window into a multi-document space?
To illustrate some of these issues, consider several examples from an operational news tracking site: the News Channel page of Excite, an information vendor and a popular search engine host for the World Wide Web, which is available via the “Ongoing Coverage” section of the news tracking page, (http://nt.excite.com). Under the heading of Articles about IRS Abuses Alleged, some entries read: