The present invention is directed to the field of electronic document format conversion. It finds particular application in the alignment of pairs of documents in different extended markup language (XML) formats, and will be described with reference thereto, although it is to be appreciated that the method is also applicable to the alignment of documents in other formats.
Some of the benefits of electronic documents over paper documents include enhanced document processing capabilities and easier manipulation of documents, such as creation, editing, updating, storage, access, and delivery of documents. A key enabler for such enhancement in known systems is their ability to represent not only the contents of documents but also various meta-information about the contents. For instance, document structures, such as chapter, section, and paragraph breaks can be explicitly represented for enhanced browsing, retrieval, and component reuse.
Companies and organizations that own data and documents in electronic form frequently face a problem of migrating legacy documents, often in proprietary formats, into new document formats that allow performance of such operations in a most cost effective and efficient manner. This efficiency is obtained by sharing meta-information in the document. A standard formalism for encoding this meta-information and data exchange is extendable mark-up language (XML). The conversion process has two main steps. The first main step involves design of a rich and highly structured document model. The second main step involves conversion of the legacy documents into the new document model. The conversion process not only transforms legacy documents from an existing format into a new one, such as, for example, from Microsoft Word™ into extended mark-up language, but also customizes information which is not explicitly encoded in the legacy documents.
For Microsoft Word™ documents, for example, several conversion solutions exist. These conversion solutions use a proprietary model to save the document content along with all structural, layout and mark-up instructions. Although the document content is converted into a standard structure format, this solution is often insufficient from a user's point of view, as it addresses not the document content with associated semantics, but instead addresses how the document content is to be visualized. As a result, the document structural tags are mark-up and/or layout orientated.
Schemas describe what types of nodes may appear in documents and which hierarchical relationships such nodes may have. A schema is typically represented by an extended context-free grammar. A tree is an instance of this schema if it is a parse tree of that grammar. In this regard, it should be noted that an extended markup language (XML) schema specifies constraints on the structures and types of elements in an XML document. The basic schema for XML is the DTD (Document Type Definition). Other XML schema definitions are also being developed, such as DCD (Document Content Definition), XSchema, etc. DTD uses a different syntax from XML, while DCD and XSchema specify an XML schema language in XML itself. They all describe XML Schema. This means that they assume the common XML structure and provide a description language to say how the elements are laid out and are related to each other.
An important part of developing a system for automated conversion of documents from one format to another is the automatic learning of document transformations. During the supervised learning process, the leaves of the tree-structured source document are assigned target classes, which are obtained from given sample target documents. For the learning process it is important that a correspondence between the leaves of the source document and the leaves of the sample target document is established. This enables the learning method to assign a target class to the leaves in the source document. A suitable training set can thus only be constructed if it is known which target leaves correspond to which source leaves.