1. Field of the Invention
The present invention is directed toward the field of text summarization, and more particularly toward generating summaries of documents from structured text in those documents.
2. Art Background
In general, text summarization is the ability to provide an overview of synopsis of information from one or more sources. For purposes of nomenclature, a document is broadly defined herein as a source of information. A document may consist of a periodical, an article, a book, etc. One goal of systems that manage documents is to automate the generation of summaries for one or more documents.
Intelligent text summarization is among the largest problems facing the information management community. In the prior art, summarization of information typically deals almost exclusively with corpora from the news domain to provide summaries of a single news story. However, these techniques do not attempt to compare and contrast the news reported by more than one bureau. New summarization techniques are needed because the overwhelming amount of textual information on the Internet threatens to render the medium useless. The average request for a “match” on a single word using a search engine results in over 2,000 “hits”. Most of these “hits” are unrelated, outdated, or irrelevant to the match query. This problem of query precision is exacerbated when the user attempts to combine related textual information from multiple documents. For example, a user of an information management system may desire to compare analyst statements from multiple documents relating to a potential stock investment. Currently, the solution for processing this query results in a sequential scan of each of the analysts statements and a subsequent manual compilation of the opinions identified. As another example, if the user wishes to monitor the change in issue position statements for a political candidate, then the user sequentially scans the full text of this temporal textual information, so as to render a decision regarding the changing position. Accordingly, it is desirable to develop a system that permits a user to compare summarized information from multiple document sources.
Text structure contributes to the identification of classes of documents (e.g., business letters vs. journal articles vs. user manuals), parts of documents (e.g., the sports page vs. the classified advertisements in the newspapers) and the types of information contained in a document (e.g., subsidiary information in footnotes vs. primary information in titles, paragraph breaks at topic breaks). Text structure clearly plays a role in text comprehension. The use of text structure is largely ignored in fields of computational linguistics and information retrieval. In the prior art, no information retrieval system or information extraction system uses more than a cursory use of text structure. There is also no attempt to utilize the structure of text to add intelligence to the summarization process.
In general, text structure involves identifying, with a standardized language, various attributes of a document. For example, HTML documents, pervasive on the Internet, provide some information about text structure including headings and paragraph breaks. However, the minimal amount of structural information provided in HTML documents does not provide the underlying structure for use to generate documents summaries. The eXtensible mark-up language (“XML”) embeds structural information into a textual document. It is desirable to develop a text summarization system that utilizes structural information to generate summaries of documents. As is described fully below, the present invention utilizes text structure (i.e., structural information embedded into a document), to create hierarchical relationships using text type understanding so as to provide enhanced text summarization of documents.