This disclosure relates generally to systems and methods for converting legacy documents into XML format.
Many companies and organizations that own data and documents in electronic form continually face a problem of migrating their legacy documents into new document formats that allow performance of operations in a most cost effective and efficient manner. This efficiency is obtained by sharing meta-information in the document. A standard formalism for encoding this meta-information and data exchange is extendable mark-up language (XML). The conversion process not only transforms legacy documents from an existing format into a new one, such as, for example, from Microsoft Word® into extended mark-up language, but also customizes information which is not explicitly encoded in legacy documents.
XML is the modern industry standard for data exchange across service and enterprise boundaries. It is becoming a significantly common practice to use XML as the underlying data model for information organization, exchange and reuse. A large spectrum of activities led by the Web Consortium, OASIS and others around XML, including XML schemata, XML querying and transformation, has led to an increasing availability of data and documents in XML format, the proliferation of domain-specific XML schema definitions using different schemata formalisms (Data Type Definitions (DTDs)), W3C XML Schemas, Relax NG), thus intensifying their exchange and increasing their reuse.
Migration to XML can be broken down into two important branches: data-oriented XML and document-oriented XML. Data-oriented XML refers to well-structured data and storage systems such as databases and transactional systems. Migration of this kind of data toward XML poses no serious problems, since data is already well-structured and ready for machine-oriented processing. However, problems associated with migrating documents toward XML are more serious and numerous. Documents of the type that often form corporate or group or personal knowledge bases, are unstructured or semi-structured, i.e., they are stored in generic or specialized file systems, in a multitude of formats and forms. The prevailing majority of documents are created for humans, with a number of implicit assumptions and choices which are easy for a human reader to select, process and evaluate, but difficult and ambiguous for computer programs to process efficiently. The migration of documents toward XML thus addresses the transformation of the documents into a form that facilitates machine-oriented processing and reuse of documents, while making all assumptions explicit and reducing the ambiguity of choices where possible.
Much work has been done in the field of automatic transformation of complex documents. For example, U.S. Pat. No. 7,165,216 filed Jan. 14, 2004, to B. Chidlovskii et al. for “Systems and Methods for Converting Legacy and Proprietary Documents into Extended Mark-Up Language Format” (“A3128-US-NP”) treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model.
Transformation or conversion of legacy documents into XML documents may be seen as close to the wrapping of Web sources (since it addresses the migration of semi-structured Web documents into the global target schema). While wrapping copes with the extraction of (10 or fewer) elements from simple and regular Web pages and presenting them as plain tuples, conversion or transformation of other kinds of legacy documents in electronic formats must deal with hundreds of elements interrelated in a complex manner (guided by the target schema) and no unique method (including wrapper writing or induction) would work well.