A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of any portion of the patent document, as it appears in any patent granted from the present application or in the Patent and Trademark Office file or records available to the public, but otherwise reserves all copyright rights whatsoever.
An appendix containing source code listing utilized in practicing an exemplary embodiment of the invention is included as part of the Specification and is hereinafter referred to as Appendix A. Appendix A is found on pages 30-59 of the Specification.
The present invention relates in general to the field of natural language processing and automatic text analysis and summarization. More particularly, the present invention relates to a method and system for topical segmentation of a document and classification of segments according to segment function and importance.
Identification of a document""s discourse structure can be extremely useful in natural language processing applications such as automatic text analysis and summarization and information retrieval. For example, simple segmentation of a document into blocks of topically similar text can be useful in assisting text search engines to determine whether or not to retrieve or highlight a particular segment in which a query term occurs. Similarly, topical segments can be useful in assisting summary agents to provide detailed summaries by topic in accordance with a segment function and/or importance. Topical segmentation is especially useful for accurately processing long texts having multiple topics for a wide range of natural language applications.
Conventional methods for topical segmentation, such as in Hearst""s TextTiling program, identify zero or more segment boundaries at various paragraph separations, which in turn identify one or more topical text segments. See M. Hearst, xe2x80x9cMulti-Paragraph Segmentation of Expository Text,xe2x80x9d Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (1994). Topical segmentation is thus linear, but based solely upon the equal consideration of selected terms. Terms are regarded as equally important in deciding how to segment the document input, and as such segmentation does not leverage the differences between term types. TextTiling, in addition, makes no effort to measure the significance and function of identified topical segments.
Other conventional methods use hierarchical segmentation to create tree-like representations of a document""s discourse structure. See U.S. Pat. No. 5,642,520; D. Marcu, xe2x80x9cThe Rhetorical Parsing of Natural Language Texts,xe2x80x9d The Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics at pp. 96-103 (1997); Y. Yaari, xe2x80x9cSegmentation of Expository Text by Hierarchical Agglomerative Clustering,xe2x80x9d Recent Advances in NLP 1997. Bulgaria (1997). Hierarchical segmentation attempts to calculate not only topic boundaries, but also subtopic and sub-subtopic boundaries. This is inherently a more difficult task and can be prone to more sources of error. Researchers also define xe2x80x9ctopicxe2x80x9d differently such that many times a topic boundary in one text can correspond to a subtopic or a supertopic in another segmentation program.
Still other conventional hierarchical schemes, for example, use complex xe2x80x9cattentionalxe2x80x9d models or rules that look at the topic of discussion for a particular sentence; that is, the focus of the sentence. Attentional models are commonly used to determine pronominal resolution, e.g., what person does xe2x80x9chexe2x80x9d or xe2x80x9cshexe2x80x9d refer to in the text, and usually require contextual knowledge that is often difficult to glean from the language input using automated methods. See U.S. Pat. No. 5,642,520.
Again, as with conventional linear segmentation schemes, no effort is made with conventional hierarchical schemes to determine the contextual significance or function of the identified topical segments.
The aforedescribed limitations and inadequacies of conventional topical segmentation methods are substantially overcome by the present invention, in which a primary object is to provide a method and system for segmenting text documents so as to efficiently and accurately identify topical segments of the documents.
It is another object of the present invention to provide system and method that identifies the significance of identified topical segments.
It is yet another object of the present invention to provide system and method that identifies the function of identified topical segments.
In accordance with a preferred method of the present invention, a method is provided that includes the steps of: extracting one or more selected terms from a document; linking occurrences of the extracted terms based upon the proximity of similar terms; and assigning weighted scores to paragraphs of the document input corresponding to the linked occurrences, wherein the scores depend upon the type of the selected terms and the position of the linked occurrences with respect to the paragraphs, and wherein the scores represent boundaries of the topical segments.
In accordance with another preferred method of the present invention, a method is provided for automatically extracting significant topical information from a document, the method including the steps of: extracting topical information from a document in accordance with specified categories of information; linking occurrences of the extracted topical information based on the proximity of similar topical information; determining topical segments within the document corresponding to the linked occurrences of the topical information; and determining the significance of the topical segments.
In another aspect of the present invention, a computer program is provided for topical segmentation of a document""s input. The computer program includes executable commands for: extracting selected terms from a document; linking occurrences of the extracted terms based upon the proximity of similar terms; and assigning weighted scores to paragraphs of the document input corresponding to the linked occurrences, wherein the scores depend upon the type of the selected terms and the position of the linked occurrences with respect to the paragraphs, and wherein the scores represent boundaries for the topical segments.
In yet another aspect of the present invention, a computer program is provided for automatically extracting significant topical information from a document. The computer program includes executable commands for: extracting topical information from a document in accordance with specified categories of information; linking occurrences of the extracted topical information based on the proximity of similar topical information; determining topical segments within the document corresponding to the linked occurrences of the topical information; and determining the significance of the topical segments.