Document analysis systems analyse an image of a document to identify and extract content based information from various regions which collectively form the document. Typically, document analysis systems identify each of the various regions of the document and classify the regions as either text or non-text. This form of content based analysis is an important precursor to document management, synthesis and processing for display.
The proliferation of embedded document analysis on scanning devices and software-based document analysis systems has made conversion of a scanned document to a high level document, such as PDF, Word™, PowerPoint™ and Excel, increasingly common. As the accuracy of content extraction improves, the motivation of scan to high level document processing has gradually shifted from use for archiving to content reuse. Text content reuse has been well established due to the prevalence of optical character recognition (OCR), while other applications that perform non-text content reuse are still uncommon and are often fairly limited.
One incorrect interpretation of a scanned document may be caused by relative colour assignment among extracted non-text objects. Typically only a small set of colours were used in the original document to represent the non-text objects. However due to halftoning introduced during the printing process, the printed document may have a difference in colour from the original. As a result, the high level documents generated from a scan of the printed document will often have a noticeable colour discrepancy from the original document when the extracted non-text objects are compared. The most noticeable colour discrepancy is caused by a loss of highlight shading. For example, a table often has text with light shading to highlight column headings. The text shading is often removed from the extracted table during document analysis process found in existing conversion tools. A second noticeable colour discrepancy is that objects of the same colour in an original document may have different colours in the high level document.
These incorrect interpretations limit the scope of reuse of the extracted non-text content. As a result there is a need to alleviate the aforementioned problems in a document analysis system for scan-to-high-level-document applications.