Unstructured computer document files (e.g., documents formatted as word processing files) are converted to a structured format (e.g., extensible mark-up language (XML) formats) in order to provide tracking, meta-data management, a table of contents, word and phrase searching and retrieval, scaling, and the like so that, for example, books can be converted to e-Book file formats and so a company's reports and documents can be more easily managed and even served up over the world wide web, searched, organized, and viewed one page at a time. Thus, there has long been a need to convert unstructured document files to a structured format.
In the prior art, individual documents were first converted to the XML format manually at a great cost because the manual conversion process was lengthy and required a technician skilled in XML as well as knowledgeable in the subject matter of the document. There was also an attempt to have people draft documents according to the XML format in the first instance—an attempt which understandably failed.
Next, those skilled in the art attempted to design computer based conversion engines which were to automatically convert “Word” files to, for example, the XML format. In reality, however, the conversion engines produced inaccurate results and low quality products and, as a result, manual labor was still required to correct the output of the conversion engine for quality assurance purposes and to ensure that the XML representation of the original file was correct. A technician would directly compare the original “Word” document, for example, with the XML representation of the document output by the conversion engine resulting in added costs and long conversion cycles.
Accordingly, until now, there was no adequate system or method for converting unstructured documents to a structured format.