Document collections are generated and maintained by businesses, governments, and other organizations. Such document collections are typically accessible via the Internet, a local computer network, or so forth. Documents are created in various diverse formats such as word processing formats, spreadsheet formats, the Adobe portable document format (PDF), hypertext markup language (HTML), and so forth.
Conversion of documents in these diverse formats to a common structured format has certain advantages. PDF and HTML are platform independent formats that enable exchange of documents across computing platforms. However, PDF and HTML are not as effective at facilitating document reuse or repurposing. HTML markup tags, for example, are not very effective at generating complex document organizing tree structures. Moreover, neither PDF nor HTML impose stringent document formatting structure requirements. For example, valid HTML documents can have missing closing tags and other formatting deficiencies.
The extensible markup language (XML) incorporates a document type definition (DTD) section that imposes stringent document formatting requirements. The DTD also supports complex nesting or tree structures. Thus, XML is being recognized as a common document format suitable for document reuse, repurposing, and exchange. Along with XML, other structured formats that include structuring schema or other explicit organization can be used to provide a common structured document format.
Thus, there is a strong motivation to provide robust and flexible conversion tools for converting word processing, spreadsheet, PDF, HTML, and other types of documents to the common XML or other structured format. However, existing conversion tools typically make strong assumptions about source document structure, which limits these conversion tools to a small sub-set of documents.