Documents obtained via an electronic medium (i.e., the Internet or on-line services, such as AOL, Compuserve or other services) are often provided in such volume that it is important to be able to summarize them. Oftentimes, it is desired to be able to quickly obtain a brief (i.e., a few sentences or a paragraph length) summary of the document rather than reading it in its completeness. Most typically, such documents span several paragraphs to several pages in length. This invention concerns itself with this kind of document, hereinafter referred to as average length document. Summarization of document content is clearly useful for assessing the contents of items such as news articles and press releases, where little a priori knowledge is available concerning what a document might be about; summarization or abstraction facility is even more essential in the framework of emerging "push" technologies, where a user might have very little control over what documents arrive at the desktop for his/her attention.
Conventional summarization techniques for average length documents fall within two broad categories. One category is those techniques which rely on template instantiation and the other category is those techniques that rely on passage extraction.
Template Instantiation
A template is best thought of as a set of predefined categories for a particular domain. A template instantiation technique for content summarization is based on seeking to instantiate the plurality of such categories with values obtained from the body of a document-assuming that the document fits the expected domain. These types of techniques are utilized for documents that can be conveniently assigned to a well-defined domain and are known to belong to such a domain. Examples of such constrained domains are news stories about terrorist attacks or corporate mergers and acquisitions in the micro-electronics domain.
Template instantiation systems are specially designed to search for and identify predefined features in text: restricting documents to a domain whose characteristic features are known ahead of time allows a program to identify specific aspects of the story such as: `who was attacked`, `who was the perpetrator`, `was the acquisition friendly or hostile`, and so forth. A coherent summary can be then constructed by "fitting" the facts in a template. Unfortunately, these systems are by design limited to the particular subject domains they were engineered to cover because the systems, in effect, search for particular words and word patterns and can only function assuming their existence, and mapping onto, the domain categories (see Ralph Grishman, "Information Extraction: Techniques and Challenges", in M. T. Pazienza (Ed.), "Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology", Springer, 1997, and references therein).
Sometimes, a set of proper names and technical terms can be quite indicative of content. Phrasal matching techniques, developed for the purposes of template instantiation, are able to provide a list of the pertinent terms within a document. Such techniques have grown to become quite robust (see J. S. Justeson and S. M. Katz, "Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text", Journal of Natural Language Engineering, vol.1(1), 1995; see also "Coping with Unknown Lexicalizations", in B. K. Boguraev and J. Pustejovsky (Eds.), "Corpus Processing for Lexical Acquisition", MIT Press, 1996). If a document is small enough then complete lists of proper names and technical terms can provide a relatively informative characterization of the document content. However, for longer documents the term list will be plagued by unnecessary and incorrect terms, ultimately defeating their representativeness as content abstractions.
Accordingly, this type of summarization technique requires a front end analysis sensitive to a domain description, and capable of filling out domain-specific templates which will provide for accurate summarization of the document; thus it depends on knowing, a priori, the document's domain.
Passage Extraction
Passage extraction techniques do not depend on prior knowledge of the domain. They are based on identifying certain passages of text (typically sentences) as being most representative of the document. This type of technique typically uses a statistical approach to compute the "closeness" between a sentence and the document as a whole. Generally speaking, this closeness is determined by mapping individual sentences, as well as the entire document, on to multidimensional vector space, and then performing mathematical calculations to determine how similar (by some appropriate metric) the sentence is to the text. Generally speaking, if a sentence has many words which repeatedly appear throughout the document, it will receive a relatively high score. Then, the highest ranking sentence(s) is (are) presented as a summary of the document.
Such "summarization" programs, some of which are beginning to get deployed commercially, do not provide true summaries, in the sense of a summary being e.g. an abstract capturing the essential, core content of a document. While being more indicative of what a document is about, when compared with only a title, for instance, such a set of sentences is under-representative of all the topics and themes possibly running through a document. A document may have several important topics discussed therewithin. Unfortunately, in such documents, while a small selection of sentences typically conveys the information relating to one topic, they may fail to convey the existence of other topics in the document.
Accordingly, what is needed is a system and method for analyzing documents to a finer grain of topic identification and content characterization than when utilizing conventional techniques. In a preferred embodiment of the invention, the system and method should be able to analyze documents with multiple topics. The analysis would be used to produce summary-like abstractions of the documents. The system and method should be easy to implement and cost-effective. Furthermore, the content abstractions should contain relevant information from throughout the document, not just a selection of sentences that may miss significant topics. The present invention addresses these needs.