1. Field of the Invention
The present invention is related generally to summarization of document contents, and more particularly to a system and method of summarizing the content of text documents through implementation of relevance measurement technologies and latent semantic analysis techniques.
2. Description of the Related Art
The explosive growth of the World-Wide Web has dramatically increased the speed and the scale of information dissemination. With a vast sea of accessible text documents now available on the Internet, conventional Information Retrieval (IR) technologies have become more and more insufficient to find relevant information effectively. Recently, it has become quite common that a keyword-based search on the Internet returns hundreds (or even thousands) of hits, by which the user is often overwhelmed. There is an increasing need for new technologies which assist users in sifting through vast volumes of information, and which can quickly identify the most relevant documents.
Given a large volume of text documents, presenting the user with summaries of these documents greatly facilitates the task of finding documents containing desired information. Text search and text summarization are two essential technologies that complement each other. Conventional text search engines return a set of documents based upon a relevance measurement with respect to a keyword query, for example; text summarization systems may then produce document summaries that facilitate a quick examination of the contents of each text document returned by the search (by providing, for example, an overview, keyword summary, or abstract).
In other words, a text search engine may typically serve as an information filter for identifying an initial set of relevant documents, while a cooperating text summarization system may serve as an information spotter for assisting the user in identifying a final set of desired or relevant documents.
There are two types of text summaries: generic summaries, and query-relevant summaries. Generic summaries provide an overall sense of a particular document's content, while query-relevant summaries present only content from a particular document that is closely related to the initial search query.
A good generic summary should contain the main topics presented in a document while minimizing redundancy. Since the generic summarization process is not responsive to a particular keyword query or topic search, developing a high quality generic summarization method and system has proven very challenging. A query-relevant summary, on the other hand, presents document contents that are specifically related to an initial search query; in many existing systems, creating a query-relevant summary is essentially a process of retrieving query-relevant sentences from the document. It will be appreciated by those of skill in the art that this process is strongly related to the text retrieval process. Accordingly, query-relevant summarization is most often achieved simply by extending conventional IR technologies.
Many text summarization methods have been proposed; many recent research studies have been directed toward query-relevant text summarization methods. For example, B. Baldwin and T. S. Morton have proposed a query-sensitive summarization method that selects sentences from the documents until all the phrases in the query are represented. A sentence in the document is considered to represent a phrase in the query if the sentence and the phrase “co-refer” to the same individual, organization, event, and so forth (B. Baldwin et al., Dynamic Co-reference-Based Summarization, in Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP3), Granada, Spain, June 1998). R. Barzilay and M. Elhadad have developed a method that creates text summaries by finding lexical chains in documents (R. Barzilay et al., Using Lexical Chains For Text Summarization, in Proceedings of the Workshop on Intelligent Scalable Text Summarization (Madrid, Spain), August 1997).
Mark Sanderson has approached the problem by dividing each document into equally sized overlapping passages, and using the INQUERY IR system to retrieve the passage from each document that best matches a query. This “best passage” is then used as a summary of the document. A query expansion technique called Local Context Analysis (LCA, which is also from INQUERY) is used before the best passage retrieval. Given a topic and a document collection, the LCA procedure retrieves top-ranked documents from the collection and examines the context surrounding the topic terms in each retrieved document; LCA then selects the words or phrases that are frequent in these contexts and adds these words or phrases to the original query (M. Sanderson, Accurate User Directed Summarization From Existing Tools, in Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM98), 1998).
The SUMMARIST text summarizer from the University of Southern California attempts to create text summaries based on the equation:summarization=topic identification+interpretation+generationThe identification stage filters the input document to determine the most important central topics. The interpretation stage clusters words and abstracts them into some encompassing concepts. Finally, the generation stage generates summaries either by outputting some portions of the input, or by creating new sentences based on the interpretation of the document concepts (E. Hovy et al., Automated Text Summarization in Summarist, in Proceedings of the TIPSTER Workshop, Baltimore, Md., 1998). This generation function was not realized in the work upon which this paper was based.
The Knowledge Management (KM) system from SRA International, Inc. extracts summarization features using morphological analysis, name tagging, and co-reference resolution. The KM approach uses a machine-learning technique to determine the optimal combination of features in combination with statistical information from the corpus to identify the best sentences to include in a summary (http://www.SRA.com). The Cornell/Sabir system uses the document ranking and passage retrieval capabilities of the SMART text search engine to identify relevant passages in a document (C. Buckley et al., The SMART/Empire TIPSTER IR System, in Proceedings of TIPSTER Phase III Workshop, 1999). The text summarizer from CGI/CMU uses a technique called Maximal Marginal Relevance (MMR), which measures the relevance of each sentence in a document, both relative to a query as well as relative to sentences that have already been added to the summary. The MMR system then produces summaries of a document by identifying key relevant, non-redundant information found within the document (J. Goldstain et al., Summarizing Text Documents: Sentence Selection and Evaluation Metrics, in Proceedings of ACM SIGIR'99, Berkeley, Calif., August 1999).
Query-relevant text summaries such as those mentioned above may be useful for determining whether a given document is relevant to a user's query, and, if a document is relevant, for identifying which part of the document is related to the query. Since query-relevant summaries are created responsive to particular queries, however, these types of summaries do not provide an overall sense of the document content; consequently, query-relevant summaries are not appropriate for content overview. Generic text summarization techniques must be developed for identifying key topics within documents and for categorizing those documents.