Automated analysis has become a popular tool in order to categorize electronic documents (called “documents” herein). Typically, documents are analyzed through a variety of automated techniques, such as document clustering, summarization, and indexing. Such techniques are used to help people determine, respectively, similarities between documents, a synopsis of a document or documents, and a way of navigating through multiple documents.
In particular, summarization of multiple documents can be helpful, for example, when browsing through search results or when editing or exploring a taxonomy (e.g., a classification of items based on similarities between the items, such as a set of hierarchically-organized documents). For instance, home repair may be divided into a number of similar topics, such as repair of electrical systems, replacement of breakers, wiring new circuits, and replacing switches in preexisting circuits.
Some techniques for analysis of documents use phrasal expressions, typically comprising one or more words, during analysis. For example, “nuclear power” is a phrasal expression that might be of some value for a certain document. This phrasal expression could then be used to summarize the document, if, for instance, the phrasal expression occurs a predetermined number of times in the document. Additionally, if a collection of documents have the phrasal expression “nuclear power plant,” then this phrasal expression can be used in a summary of the collection.
Although document analysis is beneficial to distill a summary or multiple summaries of a collection of documents, conventional document summarization techniques tend to become overburdened when there are a large number of miscellaneous documents in the collection being summarized. Additionally, the generated summaries may not make sense relative to a collection of documents. For instance, the phrasal expressions “nuclear power” and “nuclear proliferation” might appear in the collection enough to be used to summarize the collection, but a summarization of the collection may not indicate if the two phrasal expressions are related. Therefore, a person attempting to use the summarization to navigate the collection may not realize that the two phrasal expressions are or are not related in the collection.
Thus, there is a need to improve document summarization techniques.