1. Field of the Invention
The present inventions relate to software applications that aim to automate the generation of extensible markup language (XML) structure from plain-text documents, rich-text documents and textual data records, in which software provides for implementing the automated inference of XML structure and application of corresponding XML markup to target documents and textual data records, and for automated conversion of unstructured textual documents to XML.
2. Description of the Related Art
Many businesses migrating to XML-based IT solutions will face the problem of converting large volumes of legacy documents existing in various storage formats to XML. The conversion problem also arises in scenarios where XML is needed by back-end and workflow systems, but document authors are unwilling or unable to use a specialized XML authoring tool and typically prefer to work instead in a generic wordprocessor such as Microsoft Word. Transformation of unstructured content into XML is one of the most challenging tasks in many XML-oriented initiatives. In many multi-channel publishing environments, content conversion to XML is often a requirement. For such environments, there is a need for highly effective, fully customizable conversion of unstructured textual content to XML, without disrupting communication with authors and content contributors who are using ordinary wordprocessor documents.
There are currently a number of converter software packages available, most of them classifiable as RTF-to-XML converters. The basis for this generic classification is the common assumption that various textual document formats, such as Microsoft Word and Corel WordPerfect, can be easily converted to RTF first, with minimal or no loss of fidelity, and then a single, uniform method can be used for parsing the RTF data, analyzing text content and formatting, and producing XML output conforming to some predefined XML schema/DTD. Similarly, some solutions use HTML or a proprietary intermediate format. Known software converter packages include XDocs from CambridgeDocs (Charlestown, Mass.), VorteXML from Datawatch Corporation (Lowell, Mass.), ContentMaster from Itemfield (Israel), Logictran RTF Converter (Minnetonka, Minn.), X-ICE by Turnkey Systems (Sydney, Australia), upCast by Infinity Loop (Germany), YAWC (Ireland) and Omnimark from Stilo (Bristol, United Kingdom). Typically the basis for conversion in prior art systems is mapping styles and custom formatting to XML elements, sometimes using text patterns as well. Some converters provide integration with a standard scripting language or define one of their own so that custom conversion rules and conditions can be expressed, e.g., Omnimark from Stilo (Bristol, United Kingdom). It is worth noting that in most cases mapping of patterns to schema elements is done ad hoc, without relying on some schema-guided conversion model that takes into account the element nesting and validity constraints defined in the target schema. More esoteric or special-purpose conversion applications are known that employ statistical analysis (Bayesian probability), vector machines, or neural networks as a basis for more “intelligent” structure inference.
Conversion quality largely depends on the structural consistency of input documents, the availability and consistency of formatting, the sophistication of the conversion tool and the extent to which it is properly configured and optimized for processing of specific document types. The performance of prior solutions rarely has been satisfactory in practice. After the initial ‘batch’ processing, an operator or a content specialist usually needs to review the resulting XML document(s), manually fix structure inference errors and create any missing desired structure. Doing this typically involves using a specialized XML editing tool, which is independent from and not conveniently integrated with the conversion tool used in the first place. If it is found that poor conversion results are due to inconsistent or unexpected formatting or order of elements in the source unstructured document, either the document has to be modified to match the conversion rules and patterns or the latter have to be modified to account for the variability, and eventually the whole conversion-review-correction process has to be repeated. Even in a fully automated conversion process, human intervention is often unavoidable if semantically and structurally valid documents are the objective.
A need exists for the provision of quality support for conversion of unstructured documents to an XML-compatible structured form. To this end, it would be desirable to facilitate the entire conversion process (document analysis, definition of conversion rules and patterns, invocation of automatic parsing and markup generation, and subsequent review, correction and completion of results) within the GUI workspace of an XML-enabled generic wordprocessor such as Microsoft Word, which can be more efficient and convenient than the use of traditional RTF-to-XML converters in combination with standalone RTF viewers and XML editors. Further it would be desirable to provide an integrated set of GUI tools for streamlined review of the conversion results and automatic identification of omissions and potential ‘trouble spots’ in the document. Another significant advantage of having document conversion functionality built within an XML-enabled wordprocessor over other conversion frameworks would be that all the original formatting and layout of the source document could be preserved, eliminating the need for manual re-formatting after XML markup is applied.
Two related additional problems associated with traditional converters are that 1) they ignore and subsequently lose significant formatting information and structural clues from the source document that are not explicitly recognized and/or somehow incorporated into the output XML data and 2) they separate (branch) the resultant XML document from the source unstructured document. These deficiencies are a consequence of the basic fact that existing conversion solutions build or convert to a new XML document from scratch and create element markup for source content ranges of only recognized formatting, while pure XML has no provisions for expressing formatting information. Therefore, ranges with unrecognized formatting get reduced to plain text in the output.
In a variety of initiatives involving streamlining of document-centric enterprise business processes, conversion to XML is not an end in itself. Rather, it should be viewed only a means to enable automated processing of documents and execution of business logic based on the data contained in them, while humans continue to consume and update the content of their documents, desirably just the way they did this before introduction of XML in the process. The recent availability of XML-enabled generic wordprocessor applications (Microsoft Word 2003+, HyperVision's WorX for Word plug-in in conjunction with Word 2000+, Corel WordPerfect) creates the novel possibility for automatic application of XML-compatible markup to textual documents while maintaining the documents' rich-text content intact and avoiding versioning and content synchronization problems by essentially keeping the generated XML markup with the source data (and not having any other copies of the data at all). XML-aware domain-specific business applications could be built to operate on thus structured documents involved in a continuous business process, without burdening users with the complexity of a specialized XML authoring tool. Preservation of the original layout (e.g., white space, pagination, line numbering and the like) is often desirable and advantageous as a crucial requirement for many document types, especially the legal documents. Such applications may also need the ability to have XML structure/markup applied to select document ranges only, not to the entire document at once. For example, blocks of unstructured data, such as customer addresses or standard contract clauses, may need to be imported from outside and then automatically structured in accordance with the XML schema associated with the document. In summary, providing all such automated XML structuring capabilities and benefits in the context of XML-enabled wordprocessor applications is among the objects of the present invention.