The present invention relates generally to a system and method for reviewing documents. More particularly, the present invention relates to presentation of documents in a manner that allows the user to quickly ascertain their contents.
Documents obtained via an electronic medium (i.e., the Internet or on-line services, such as AOL, Compuserve or other services) are often provided in such volume that it is important to be able to summarize them. Oftentimes, it is desired to be able to quickly obtain a brief (i.e., a few sentences or a paragraph length) summary of the document rather than reading it in its completeness. Most typically, such documents span several paragraphs to several pages in length. The present invention concerns itself with this kind of document, hereinafter referred to as average length document.
Present day summarization technologies fall short of delivering fully informative summaries of documents. To some extent, this is so because of shortcomings of the state-the-art in natural language processing; in general, the issue of how to customize a summarization procedure for a specific information seeking task is still an open one. However, given the rapidly growing volume of document-based information on-line, the need for any kind of document abstraction mechanism is so great that summarization technologies are beginning to get deployed in real world situations.
The majority of techniques for xe2x80x9csummarizationxe2x80x9d, as applied to average-length documents, fall within two broad categories. A class of techniques mine a document for certain pre-specified pieces of information, typically defined a priori, on the basis of fixing the most characteristic features of a known domain of interest. Other approaches rely, in effect, on xe2x80x98re-usingxe2x80x99 certain fragments of the original text; these have been identified, typically by some similarity metric, as closest in meaning to the whole document. This categorization is not a rigid one: a number of approaches (as exhibited, for instance, in a recent workshop on Association for Computational Linguistics, xe2x80x9cProceedings of a Workshop on Intelligent, Scalable, Text Summarization,xe2x80x9d Madrid, Spain, 1997) use strong notions of topicality (B. Boguraev and C. Kennedy, xe2x80x9cSalience-based content characterization of text documents,xe2x80x9d in Proceedings of ACL ""97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), (E. Hovy and C.Y. Lin, xe2x80x9cAutomated text summarization in SUMMARIST,xe2x80x9d in Proceedings of ACL ""97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), lexical chains (R. Barzilay and M. Elhadad, xe2x80x9cUsing lexical chains for text summarization,xe2x80x9d in Proceedings of ACL ""97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), and discourse structure (D. Marcu, xe2x80x9cFrom discourse structures to text summariesxe2x80x9d, in Proceedings of ACL ""97 Workshop on Intelligent, Scalable Text Summarization, Madrid, Spain, 1997), (U. Hahn and M. Strube, xe2x80x9cCentered segmentation: scaling up the centering model to global discourse structure,xe2x80x9d in Proceedings of ACL-EACL/97, 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 1997), thus laying claim to newer sets of methods.
Still, at a certain level of abstraction, all approaches share a fundamental similarity: summarization methods today rely, in essence, on substantial data reduction over the original document source. Such a position leads to several usability questions.
Given the extracted fragments which any particular method has identified as worth preserving, what is an optimal way of encapsulating these into a coherent whole, for presenting to the user? Acknowledging that different information management tasks may require different kinds of summary, even from the same document, how should the data discarded by the reduction process be retained, in case a reference is necessary to a part of the document not originally included in the summary? What are the trade-offs in fixing the granularity of analysis: for instance, are sentences better than paragraphs as information bearing passages, or are phrases even better? Of particular importance to this invention is the question of xe2x80x9cuser involvement.xe2x80x9d From the end-user""s point of view, making judgments, on the basis of a summary, concerning what a document is about and whether to pay it closer attention would engage the user in a sequence of actions: look at the summary, absorb its semantic impact, infer what the document might be about, decide whether to consult the source, somehow call up the full document, and navigate to the point(s) of interest. Given that this introduces a serious amount of cognitive and operational overhead, what are the implications for the user when they are faced with a large, and growing, number of documents to deal with on a daily basis?
These are only some of the questions concerning the acceptability of summarization technology by end users. There is particular urgency, given the currently evolving notion of xe2x80x9cinformation pushxe2x80x9d, where content arriving unsolicited, and in large quantities, at individual workstations threatens users with real and immediate information overload. To the extent that broad coverage summarization techniques are beginning to get deployed in real world situations, it is still the case that these techniques are based primarily on sentence extraction methods. In such a context, the above questions take on more specific interpretations. Thus, is it appropriate to concatenate together the sentences extracted as representativexe2x80x94especially when they come from disjoint parts of the source document? What could be done, within a sentence extraction framework, to ensure that all xe2x80x98themesxe2x80x99 in a document get represented by the set of sentences identified by the technology? How can the jarring effect of xe2x80x98danglingxe2x80x99 (and unresolved) references in the selectionxe2x80x94without any obvious means of identifying the referents in the original textxe2x80x94be overcome? What mechanisms could be developed for offering the user additional information from the document, for more focused attention to detail? What is the value of the sentence, as a basic information-bearing unit, as a window into a multi-document space?
To illustrate some of these issues, consider several examples from an operational news tracking site: the News Channel page of Excite, an information vendor and a popular search engine host for the World Wide Web, which is available via the xe2x80x9cOngoing Coveragexe2x80x9d section of the news tracking page, (http://nt.excite.com). Under the heading of Articles about IRS Abuses Alleged, some entries read:
Example 1
RENO ON Sunday/Reform Taxes the . . .
The problem, of course, is that the enemies of the present system are all grinding different axes. How true, how true, and ditto for most of the people who sit on the Finance Committee. (First found: Oct. 18, 1997)
Example 2
Scheduled IRS Layoffs For 500 Are.
The Agency""s original plan called for eliminating as many as 5,000 jobs in field offices and at the Washington headquarters. xe2x80x9cThe way this has turned out, it works to the agency""s advantage, the employees"" advantage and the union""s advantage.xe2x80x9d (First found: Oct. 17, 1997.)
Both examples present summaries as sentences which almost seamlessly follow one another. While this may account for acceptable readability, it is at best misleading, as in the original documents these sentences are several paragraphs apart. This makes it hard to know that the references to xe2x80x9cHow true, how truexe2x80x9d in the first example, and xe2x80x9cThe way this has turned outxe2x80x9d in the second, are not whatever might be mentioned in the preceding summary sentences, but are, in fact, hidden somewhere in the original text of the documents. Opening references to xe2x80x9cThe problemxe2x80x9d and xe2x80x9cthe agencyxe2x80x9d are hard to resolve. The thrust of the second articlexe2x80x94namely that there is a reversal of an anticipated situationxe2x80x94is not at all captured: it turns out that the missing paragraphs between the summary sentences discuss how the planned 5,000 layoffs have been reduced to xe2x80x9c4,000, then 1,400 and finally settled at about 500xe2x80x9d, and that xe2x80x9cnow, even those 500 workers will not be cutxe2x80x9d. As it turns out, some indication to this effect might have been surmised from the full title of the article, Scheduled IRS Layoffs For 500 Are Canceled; unfortunately, this has been truncated by a data reduction strategy which is insensitive to notions of linguistic phrases, auxiliary verb constructions, mood, and so forth.
In the extreme case, such summaries can range from under-informative (as illustrated by the first example above), to misleading (the second example), to plainly devoid of any useful information. Another example from the same site reads:
Example 3
Technology News from Wired News
This is more than 500 times thinner than a human hair. xe2x80x9cDon""t expect one in a present under your Christmas tree this year.xe2x80x9d
Accordingly, a particular problem that must be addressed is how to xe2x80x9cfill in the gapsxe2x80x9d which the data reduction process necessarily introduces as a summary is constructed by choosing certain fragments from the original source. Presently, known ways for filling such gaps, assuming of course these are even perceived, is by the active user involvement of requesting the entire document.
Currently, there is a relatively rigid mechanism typically sensitive to a mouse click, or some similar interactive command, with the simple semantics of xe2x80x9cbring up the entire document, possibly with the point of view focused on the particular sentence of the summary which received the click, presented in its natural document context, and maybe highlightedxe2x80x9d. Clearly, having a richer data structure would facilitate greater flexibility in interactions with what would be, in effect, a whole range of dynamically reconfigured summaries at different level of granularity and detail.
There is still one problem, however: the process of filling in the gaps requires active user involvement. In principle there is nothing wrong with this. In practice, real information management environments involve working with a large number of documents. It is far from clear that users will have the energy, bandwidth, dedication, and concentration required to assess, absorb, and act upon summaries for each one of these documents, by clicking their way through each member of a long static list.
Accordingly, what is needed is a system and method for presenting a plurality of documents to a user in a more expeditious fashion than when utilizing conventional techniques. In a preferred embodiment, the system and method should be able to analyze documents with multiple topics. The analysis would typically be used to produce summary-like abstractions of the documents at varying levels of granularity and detail. The system and method should be easy to implement and cost-effective. Furthermore, the document presentation should contain relevant information from throughout the document, not just a selection of sentences that may miss significant topics. The system and method should allow the presentation to be sensitive to multilayer analysis, should be able to present salient and contextualized highlights of a document and should make the document available to the user seamlessly, by an active user interface. Finally, the presentation should be adaptable such that a user decides whether he/she desires to be actively involved in the presentation. The present invention addresses these needs.
A method and system for the dynamic presentation of the contents of a plurality of documents on a display is disclosed. The method and system comprises receiving a plurality of documents and providing a plurality of topically rich capsule overviews corresponding to the plurality of documents. The method and system also includes dynamically delivering document content encapsulated in the plurality of capsule overviews.
In so doing, a system and method in accordance with the present invention can present thematic capsule overviews of the documents to users. A capsule overiew is derived for the entire document, which will depict the core content of an average length article in a more accurate and representative manner than utilizing conventional techniques. The capsule overviews, delivered in a variety of dynamic presentation modes, allow the user to quickly get a sense of what a document is about, and decide whether they want to read it in more detail. If so, the system and method greatly facilitate the process of focused navigation into the parts of the document which may be of particular interest to the user.
In a preferred embodiment, the capsule overviews include a containment hierarchy which relates the different information levels in a document together, and which includes a collection of highly salient topic stamps embedded in layers of progressively richer and more informative contextualized text fragments.
The novel presentation metaphors which the invention utilizes are based on notions of temporal typography, in particular for exploiting the interactions between form and content.