1. Field of the Invention
The present invention generally relates to the area of document processing and electronic publishing systems and more particularly relates to a method and apparatus for generating structured documents with user-defined document type definitions. The present invention also relates to a mechanism provided to users to convert unstructured documents for various presentations using the method and apparatus, wherein the unstructured documents are defined to be files composed, edited, or managed via an authoring application (e.g. word processing).
2. Description of the Related Art
The Internet is a rapidly growing communication network of interconnected computers around the world. Together, these millions of connected computers form a vast repository of hyperlinked information that is readily accessible by any of the connected computers from anywhere and anytime. With millions of web pages being created and added to this vast repository each year, there is a tremendous need to quickly and easily convert documents, such as presentations, data sheets or brochures, into a format presentable to and accessible by other applications or computers on the Internet.
It is well known that a preferable format that is presentable to a web browsing application (e.g. a browser) is in a markup language, such as HyperText Markup Language (HTML), Extensible Markup Language (XML), Standard Generalized Markup Language (SGML) or Wireless Markup Language (WML). Files or documents that are so composed, edited or managed for web browsing applications are commonly referred to as structured files or documents. Among all the benefits of the structured documents, the ability to provide user-defined document type definitions (DTD) or document schema definition opens a new paradigm for information exchange or storage. However, the challenge is how to generate structured documents with arbitrarily user-defined DTD.
An unstructured document with specific DTD can either be created from an unstructured document or converted from a structured document with other type of DTD. There are several editors for generating structure documents. The exemplary editors include Adobe FrameMaker, Arbortext Epic, and SoftQuad XMetal. These editors usually provide a structural view along with a word processing view, where the word processing view is like the traditional word processing environment for unstructured document while the structural view contains the document structure of data elements defined in certain DTD. To create a structured document from scratch in these editors, a user usually needs to create an unstructured document in the word processing view. With a desired DTD loaded in, the user constructs a document structure tree in the structural view in accordance with document elements defined in the DTD. Typically, the user is engaged in procedures by copying-and-pasting or dragging-and-dropping the data elements from the created document into the document structure tree.
To convert a structured document with one DTD into another DTD in these editors, one needs to load in the structured document, to modify the tags and attributes of document elements from one DTD to another, and to shuffle the data elements or to parse new data elements associated with redefined document elements in the new DTD.
Among the procedures described above, the association between data elements and document elements is a crucial and effortful processing for creating or converting an unstructured or structured document into a structured document with specific DTD. Several approaches have been proposed to associate the data elements and the document elements to simplify the generation of the structured document. For examples, a keyword extracting approach extracts a keyword representative of the document structure from an unstructured document and the keyword/text pairs are used as the association between document elements and data elements. A coordinate approach associates data elements with markup language tags in document elements by sorting the coordinates for coordinate documents. A logical structure approach analyzes the document structure by matching the predetermined patterns and parses the data elements based on the analyzed document elements. Nevertheless, none of the above approaches have considered using identifiers (e.g. font information) to associate the data elements and document elements. There is, therefore, a need for a generic approach to use the identifier information in user-defined document type definitions to associate data elements and document elements for generating structured documents.
In addition, the procedures required by the exemplary editors are somehow tedious and laborious and can be inherently of high cost. Quite often, a business that has many documents to convert has to outsource the process due to the inefficiency and slowness associated with the conversion process. On the other end, the conversion process conducted by a service provider is difficult to be quantified as it is mainly involved in manual and repeated processes depending on the complexities of the documents. There is thus another need for a mechanism for quantifying the conversion of the unstructured documents to structured documents for various presentations in a cost-determinable way.