1. Field of Invention
This invention relates to electronic document format conversion.
2. Description of Related Art
Many companies and organizations that own data and documents in electronic form continuously face a problem of migrating their legacy documents into new document formats, that allow performance of operations in a most cost effective and efficient manner. This efficiency is obtained by sharing meta-information in the document. A standard formalism for encoding this meta-information and data exchange is extendable mark-up language (XML). This conversion process has two main steps. The first step involves design of a rich and highly structured document model. The second main step involves conversion of legacy documents into the new document model. The conversion process not only transforms legacy documents from an existing format into a new one, such as, for example, from Microsoft Word® into extended mark-up language, but also customizes information which is not explicitly encoded in legacy documents.
For Microsoft Word® documents, for example, there exist several conversion solutions available on the Internet, for example, and from Microsoft Corporation itself. These conversion solutions use a proprietary model to save the document content along with all structural, layout and mark-up instructions. Although the document content is converted into a standard structure format, this solution is often insufficient from a users point of view, as it addresses not the document content with associated semantics, but instead addresses how the document content is to be visualized. As a result, the document structural tags are mark-up and/or layout orientated.
U.S. Pat. No. 5,491,628, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a method and apparatus for converting a first document to a second document where the first document is in a first extended attribute grammar, while the second document is in a second extended attribute grammar. An extended attribute coupling grammar couples the first and second extended attribute grammars. The first document is converted to a first tree, which is partially copied to a first copy. The first copy is completed by evaluating it's attribute with respect to the extended attribute coupling grammar. The first copy is then a partially attributed tree of the second document. The partially attributed tree is completed to form a second tree based on the second extended attribute grammar. The second tree is then converted to the second document.
U.S. Pat. No. 5,915,259, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a computer-based method for providing the generation of schemas for output documents. An output schema representing a desired output condition of a document is created from inputs comprising a tree transformation rule defined by at least a pattern, a contextual condition, an input schema, and user specified parameters. A match-identifying tree automaton is created from the one pattern, the contextual condition, and the input schema; and the match-identifying tree automaton is modified with respect to the user specified parameters.
U.S. Pat. No. 5,920,879, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a document structure conversion apparatus which performs processing for complementation and composes document structure according to a desired document class.
U.S. Pat. No. 5,970,490, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a method for processing heterogeneous data including high level specifications to drive program generation of information mediators, inclusion of structured file formats (also referred to as data interface languages) in a uniform manner with heterogeneous database schema, development of a uniform data description language across a wide range of data schemas and structured formats, and use of annotations to separate out from such specifications the heterogeneity and differences that heretofore have led to costly special purpose interfaces with emphasis on self-description of information mediators and other software modules. The method involves inputting a high level transformation rule specification.
U.S. Pat. No. 6,480,865, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a method for annotating eXtensible Markup Language (XML) documents with dynamic functionality. The dynamic functionality comprises invocations of Java objects. These annotations belong to a different name space, and thus a Dynamic XML-Java (DXMLJ) processor recognizes elements within the XML document that are tagged with DXMLJ prefix tags, processes each of these tags, and transforms the XML document accordingly.
U.S. Pat. No. 6,487,566, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a system for specifying transformation rules of Extensible Markup Language (XML) documents into other XML documents, wherein the rule language used is XML itself. The transformation rule specifications identify one or more transformations of the document to be performed when a pattern match occurs between the document and a source pattern. The specifications are used to define class specifications for objects that perform the transformations. The invention provides a pattern matching language, known as PML, that performs pattern matching and replacement functions for transforming any XML instance to any other XML instance. The PML pattern language is comprised of a sequence of rules expressed in XML, wherein each rule has four main components: (1) a source pattern (pat); (2) a condition (cond); (3) a target pattern (tgt); and (4) an action part (action).
U.S. Pat. No. 6,569,207, the disclosure of which is hereby incorporated herein in its entirety by reference, discloses a system for generating class specifications from extensible markup language (EML) schemas and then instantiating objects from those class specifications using data contained in XML documents.
Schemas describe what types of nodes may appear in documents and which hierarchical relationships such nodes may have. A schema is typically represented by an extended context-free grammar. A tree is an instance of this schema if it is a parse tree of that grammar.
In this regard, it should be noted that an extended markup language (XML) schema specifies constraints on the structures and types of elements in an XML document. The basic schema for XML is the DTD (Document Type Definition). Other XML schema definitions are also being developed, such as DCD (Document Content Definition), XSchema, etc. The main difference between DTD and DCD is that DTD uses a different syntax from XML, while DCD specifies an XML schema language in XML itself. XSchema is similar to DCD in this respect. In spite of the differences in the syntax, the goals and constraint semantics for all these XML schema languages are the same. Their commonality is that they all describe XML Schema. This means that they assume the common XML structure, and provide a description language to say how these elements are laid out and are related to each other.
Extracting information from documents, including Internet website documents, involves dealing with a diversity of information indexing schemes and formats. Analyzing the results returned by search engines and putting the parsed answers into a unified format, including storing the extracted information within a unified format, is known as “wrapping” the information.
Information extraction may use different approaches. One approach, called the local view approach, involves information extraction from unstructured and semi-structured text, wherein a wrapper is an enhancement of a basic HTML parser with a set of extraction rules. See, in this regard, D. Frietag, “Information extraction from html: Application of a General machine learning approach,” Proc. AAAI/AAI, pp. 517-523, 1998. An extraction rule often has a form of delimiters, or landmarks, that are a sequences of tags preceding, or following, an element to be extracted. For example, delimiter <td> <a> requires a text token to be preceded by tags <td> and <a>. The local view approach, which uses local delimiters in a context-less manner limits the expressive power of the delimiter-based wrappers.
The global view approach assumes that HTML pages are instances of an unknown language and attempts to identify this language. The global view benefits from grammatical inference methods that can learn finite-state transducers and automata from positive examples, although often requiring many annotated examples to achieve a reasonable generalization.