The exemplary embodiment relates to the field of electronic document processing. It finds particular application in connection with electronic document format conversion and in particular with processing documents formatted in an unstructured or semi-structured format, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
Organizations frequently have documents that are stored in an unstructured or semi-structured format that is difficult to reformat for different viewing devices. A common task is the batch conversion of these documents into an electronic form which allows searching and automatic transformation for presentation by different devices. Legacy documents are frequently either stored as Adobe portable document format (PDF) files or scanned from hard copies into PDF. Other common formats are image format such as portable network graphics format (png), graphics interchange format (gif), and the like. Legacy documents may also be in word processing formats or other, possibly proprietary, formats. The target formats are often XML, SGML, or HTML, which allows easy conversion into other structured formats, e.g., the epub format for ebook readers. Reformatting into the structured document may entail segmenting the document by finding paragraph divisions and generating a table of contents, information that is not readily available from unstructured scanned documents or PDF documents.
When large quantities of such documents are to be processed, batch processing by an outside service provider may be desirable. If the documents are confidential in nature, however, there may be concerns that sensitive information may be released, either during transmission or by the service provider. An encrypted channel may be used to protect the sensitive information during transmission, but this still poses a risk of disclosure by the service provider when the documents are decrypted.
It would be desirable to have a method and system for transmitting a document such that a service provider may perform processing of the structure of the document and limited processing of the content without having full access to the content.