This invention relates to the field of digital libraries. More specifically, the invention relates to the field the searching and describing documents in a digital library and/or over a network.
To utilize electronic documents efficiently, a technology which helps the user to understand the documents quickly is essential. In particular, there are many occasions where users need to look at more than one document at a time. An example is when the user gets query results from a search system.
Typically, the query result is returned in the format of a hit list. The hit list is a conventional method to visualize and summarize a set of documents resulting from a query. It contains the document titles and/or brief descriptions, which may be prepared by hand or automatically. It is generally sorted in the order of the documents"" relevance to the query. Examples may be found at http://yahoo.com and http://altavista.com, on the World Wide Web.
Clustering techniques can be used to give the user an overview of a set of documents. A typical document clustering algorithm divides the documents into groups (clusters) so that the documents in a cluster are similar to one another and are less similar to documents in other clusters, based on some similarity measurement. Each cluster can have a cluster description, which is typically one or more words or phrases frequently used in the cluster.
Much work has been done on visualization of document collections, which intends to visualize the relationships among documents. This work is generally based on clustering or a proprietary space conversion technique focusing on a large number of documents, such as a whole corpus. An example of this approach is U.S. patent application Ser. No. 5,794,178, inventors Caid, et al., entitled xe2x80x9cVisualization of information using graphical representations of context vector based relationships and attributesxe2x80x9d, filed Aug. 11, 1998. It generates context vectors representing conceptual relationships among items, and further facilitates comprehension and use of textual information using visual and graphical representations.
The existing text summarization methods are mainly for single documents. They are generally based on sentence extraction or sentence generation. There is some research work on multiple document summarization. In general, the existing multiple document summarization method assumes a homogeneous set of documents as input, such as successive news stories on the same topic. These techniques extract sentences in ways which avoid including duplicate sentences or they generate a sentences by merging similar ones.
Documents can be expressed as multidimensional data. Traditionally, factor analysis and multivariate analysis are used to analyze multidimensional data and detect the distinctive features of the data.
The conventional hit list shows the information for each document in a query result; to use it, the user has to go through the list one document at a time. The hit list shows the degree of relevance of each document to the query. However, it fails to show how the documents are related to one another.
A clustering program can be used to show which documents discuss similar topics. However, in general, a clustering program does not output explanations of each cluster (cluster labels) or, if it does, it still does not provide enough information for the user to understand the document set. For instance, U.S. patent application Ser. No. 5,857,179, inventors Vaithyanathan et al., entitled xe2x80x9cComputer method and apparatus for clustering documents and automatic generation of cluster keywordsxe2x80x9d, filed Jan. 5, 1999, outputs cluster labels containing representative terms for each cluster. However, the labels alone do not provide enough information. Typically, clustering programs tend to focus on grouping documents, and it fails to describe the main topic underlying in each cluster.
When existing single document summarization methods are applied to a document set, they have the same problem as hit lists. That is, they fail to show how the documents are related to one another.
The typical existing multiple document summarization method assumes a homogeneous set of documents as input. It fails to show how multiple topics and multiple documents are related to each other in the heterogeneous document sets.
Document visualization work generally focuses on a large number of documents to provide coarser-grained information. In general, the information provided by document visualization is too coarse-grained to be useful as a summarization for a smaller number (e.g. 50) of documents.
Factor analysis and multivariate analysis are intended to be used for general multidimensional data, and a human needs to interpret its analysis results. For the purpose of multiple document summarization, they fail to provide human-comprehensible interpretation of the analysis results.
U.S. patent application Ser. No. 5,937,422, inventors Nelson et al., entitled xe2x80x9cAutomatically generating a topic description for text and searching and sorting text by topic using the samexe2x80x9d, filed Aug. 10, 1999, creates a word list as a topic description of the input text. It searches and sorts documents by comparing the similarity of the topic descriptions. However, it fails to summarize the relationships among multiple documents and multiple topics.
U.S. patent application Ser. No. 5,774,888, inventor Light, entitled xe2x80x9cMethod for characterizing a document set using evaluation surrogates,xe2x80x9d filed Jun. 30, 1998, characterizes a document set using topic profiles. It needs the topic profiles predefined and fails to characterize the document set containing undefined topics.
These cited references are herein incorporated by reference in their entirety.
An object of this invention is a system, method, and program for identifying and describing topics in a collection of electronic documents.
Another object of this invention is a system, method, and program for identifying and describing topics in a collection of electronic documents and describing topics of related electronic documents in a group.
This invention is a computer system, method, and program product for identifying and describing one or more topics in one or more documents in a document set. A term set process creates a basic term set from the document set where the term set comprises one or more basic terms of one or more words in the document. A document vector process then creates a document vector for each document. The document vector has a document vector direction representing what the document is about. A topic vector process then creates one or more topic vectors from the document vectors. Each topic vector has a topic vector direction representing a topic in the document set. A topic term set process creates a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector. Each of the basic terms in the topic term set associated with the relevancy of the basic term. A topic-document relevance process creates a topic-document relevance for each topic vector and each document vector. The topic-document relevance representing the relevance of the document to the topic. A topic sentence set process creates a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector. Each of the topic sentences is then associated with the relevance of the topic sentence to the topic represented by the topic vector.