1. Field of the Invention
The present invention generally relates to a method and apparatus for providing document summaries and, more particularly, to a method and apparatus for providing summaries of documents belonging to a class in a classified document collection.
2. Background Description
Businesses and institutions generate countless amounts of documents in the course of their commerce and activities. These documents range from business proposals and plans to intra-office correspondences between employees and the like.
The documents of a business or institution represent a substantial resource for that business or institution. Thus, in order to more effectively store these documents it is not uncommon for the business or institution to digitally store these documents on a magnetic disc or other appropriate media.
One known method for electronically storing the documents is to first scan the documents, and then process the scanned images by optical character recognition software to generate machine language files. The generated machine language files are then compactly stored on magnetic or optical media. Documents originally generated by a computer, such as with word processor, spread sheet or database software, can of course be stored directly to magnetic or optical media.
There is a significant advantage from a storage and archival stand point to storing documents, but there remains a problem of retrieving information from the stored documents. In the past, retrieval of the documents has been accomplished by separately preparing an index to access the documents. To this end, a number of full text search software products have been developed which respond to structured queries to search a document database.
In order to further search documents, it is not uncommon for retrieval systems to prepare summaries of stored documents so that a user only has to read through the document summaries in order to find relevant documents. The use of such summary retrieval systems thus greatly reduces the time required to review the stored documents and thus provides reduced costs associated with the search and review of the stored documents.
Document summaries can be generated after document creation either manually or automatically. Of course, manually creating summaries provides high quality, but is cost prohibitive due to the labor intensive tasks of manually reading and summarizing the documents. On the other hand, automatic summaries are less expensive, but current systems do not obtain consistently high quality document summaries.
A common approach for automatically generating document summaries of individual documents relies upon either natural language processing or quantitative content analysis. Natural language processing is computationally intensive, while quantitative content analysis relies upon statistical properties of text to produce summaries. In both cases (e.g., natural language processing or quantitative content analysis), a document is typically processed in isolation to determine important words or phrases or terms, and then those words or phrases or terms are used to provide a summary of that particular processed document. Thus, in order to provide summaries for individual documents, each document is first separately processed to determine the important words or phrases or terms therein, and thereafter further processed to match those important words to provide a summary thereof. As is well understood by one of ordinary skill in the art, this type of approach is resource inefficient and time consuming.
By way of example, U.S. Pat. No. 5,689,716 to Chen discloses an automatic method of generating thematic summaries of a single document. The Chen technique begins with determining the number of thematic terms to be used based upon the number of thematic sentences to be extracted in the document. The Chen method then identifies the thematic terms within the document, and afterward, each sentence of the document is scored based upon the number of thematic terms contained within the sentence. The desired number of highest scoring sentences are then selected as thematic sentences. This same process must be used for any additional documents.
A variant of the Chen method is disclosed in U.S. Pat. No. 5,384,703 to Withgott, et al. Withgott uses regions instead of sentences, and more specifically, discloses a method and apparatus for summarizing documents according to theme. By using the method and apparatus of Withgott a summary of a document is formed by selecting regions of a document, where each selected region includes at least two members of a seed list. The seed list is formed from a predetermined number of the most frequently occurring complex expressions in the document that are not on a stop list. If the summary is too long, the region-selection process is performed on the summary to produce a shorter summary. This region-selection process is repeated until a summary of that particular document is produced having a desired length. Each time the region selection process is repeated, the seed list members are added to the stop list and the complexity level used to identify frequently occurring expressions is reduced. Similar to Chen, this same process must be used for any additional documents.
An approach used for providing a single summary for an entire collection of documents is disclosed in “Generating Natural Language Summaries from Multiple On-Line Sources” Dragomir Radev et al, Computational Linguistics, vol. 99, Nov. 9, 1998.In the Radev approach, linguistic analysis of a document collection includes filling predefined templates or information structures, and then using natural language generation techniques to provide a readable version of the formatted template.
Accordingly, what is needed is a method and system which is capable of providing a summary of individual documents without having to perform a resource intensive process on each individual document. What is further needed is a method and system which is capable of providing a summary of more than one document belonging to a class in a classified document collection.