1. Field of the Invention
The present invention relates generally to an improved document processing system and, in particular, to a computer implemented method, document processing system, and computer program product for identifying the common syntactical and semantic structures across a plethora of formatted text documents. More specifically, structural properties of pieces of text from a document collection of similar type are automatically learned, so that syntactic property rules can be applied to identify how information from multiple documents can be merged together into a corpus satisfying the concepts and relationships that have been identified, including the possibility of discovering or re-discovering one or more templates from the collection.
2. Description of the Related Art
While there has been prior work in the area of information extraction from semi-structured content, techniques disclosed in the present invention differ in the method of combining document structures and text styling for an advantage.
Further, the current invention addresses situations where a common document template has been issued and subsequently followed by individual authors, who try to provide semantically consistent text content to the pre-designated segments in the template. In view of these situations, an exemplary objective of the present invention is to better reconstruct the original document template, while still allowing the method to be robust to minor variations, omissions, or additions to the original.
In addition, the current invention discovers when more than one template was used to create a document collection, and identifies what the original templates are likely to be. It then classifies each document into the more likely template it might have followed. The multi-templates-in-a-collection can take place due to poor document management to mix documents originated from different sources. Very often the file names are not sufficiently descriptive to re-separate them. In order to process the mixed collections of documents, the current invention may be applied to separate them first before extracting the textual content within.
Prior art references discovered during preparation of the discussion herein and considered as possibly relevant to the present invention are briefly described below:
U.S. Pat. No. 6,651,058 to Sundaresan, et al. (Neelakantan Sundaresan, Jeonghee Yi) presented a method to extract concepts and relationships in HTML documents, mainly based on text term frequencies without leveraging document structures.
U.S. Pat. No. 5,799,268 to Boguraev (Branimir K. Boguraev) presented a method to automatically create a help database or index of important terms through linguistic analysis. Their method uses some limited syntactic or styling features such as headings to identify key terms in the document. There is no attempt in recovering a document template.
US Patent Application Publication No. 2006/0026203 to Tan, et al. (Ah Hwee Tan, Rajaraman Kanagasabai) focused on identifying key concepts and relationships from documents using linguistic properties such as noun-verb-noun. It also takes as input a domain database, which is not a requirement in the present invention.
U.S. Pat. No. 7,149,347 to Wnek (Janusz Wnek) presented a method to train and classify paper documents scanned in optical character recognition technology. A set of training data is required to enable Wnek's invention.
U.S. Pat. No. 6,604,099 to Chung, et al. (Christina Yip Chung, Neelakantan Sundaresan) presented a method to discover structures from ordered trees extracted out of HTML documents by tracking the position of various keywords in the trees. Their invention is limited by the fact that the set of keywords has to be provided as input by the user and is not automatically learned from the styling hints in the documents. Moreover, the method is not applicable to flat document structure, which cannot be expressed as an ordered tree.
US Patent Application Publication No. 2006/0288275 to Chidlovskii, et al. (Boris Chidlovskii, Jerome Fuselier) presented a method to classify semi-structured documents via ordered trees. They apply a Naïve Bayesian classifier on structural features of ordered trees to extract concepts from semi-structured data. But, the method does not take advantage of text styling information nor is it applicable to flat document structure, which cannot be expressed as an ordered tree.
In contrast to these above-described methods, the present invention presents a different approach based on discovering the segmentation scheme and record scheme attributes so that, for example, an original template or templates can be rediscovered.