The proliferation of imaging technology, combined with ever increasing computational processing power, has led to many advances in the area of document analysis. Document analysis systems may be used to extract semantic information from a scanned document. Such systems analyse images of documents to identify and extract content based information from various regions which collectively form the document. Typically, document analysis systems identify each of the various regions and classify the regions as text or non-text. This form of content based analysis is an important precursor to document management, synthesis and display processing. The results of this analysis can then be given to further processing which may perform various tasks such as optical character recognition (OCR) processing the text regions, compressing or enhancing the image regions, extracting non-text regions for reuse, and using content based compression techniques to reduce document image size.
One problem in document analysis is the accurate identification of text and non-text regions of images of a scanned document, particularly those including complex features such as photographs and line graphics. Popular applications of content analysis in a document management system include the generation of editable text, via optical character recognition (OCR), and the extraction of photographs and line graphics for later reuse.
The advent of powerful document editing tools and the proliferation of inexpensive colour printing systems have enabled the creation of complex colour documents, whose layouts are no longer restricted to the traditional rectangular layout style. Text and non-text often overlap without any geometric constraints, and unpredictable colour combinations are often used. Document analysis systems often employ a binarisation pre-processing step to reduce the amount of information in order to make the page layout analysis tasks less complicated. A popular method of extracting graphics regions in such systems is to look for large black connected components (CC) with low black density in the binarised image. However, having only two colours available in the analysis stages, binarisation-based document analysis systems often fail to handle some of the text and non-text colour combinations.
There are two major processing stages in any document analysis system—(i) segmenting a page into homogeneous regions, and (ii) classifying each homogeneous region into one of a set of predefined (content) classes. The choice of the number of classes depends on the purpose of the applications. A document analysis system that only needs to extract text from a page may only require text and non-text classifications. On the other hand a document analysis system that is targeted at document compression may have two classes, lossy and lossless, coupled to the choice of compression methods. The lossy regions may include photograph and background and the lossless regions may include text, graphics and background. A more complicated document analysis system may even have four classes, such as text, graphics, photo and background. Due to the growing complexity of document layouts and unpredictable colour combinations, it is becoming increasingly difficult to define a clear boundary between the two processing stages. Moreover segmenting a page into homogeneous regions is virtually impossible for documents with mixed content such as embedded text in graphics or photos.
For documents that need to have text, graphics and photographs (“photo”) extracted separately, precision in content geometry and accuracy in content type classification are equally important. Cascaded classification methods are often used to achieve these two competing requirements. The most common cascaded approach firstly classifies a document image into text and non-text regions and then each non-text region is classified into graphics or photo. The accuracy of graphics extraction therefore has a compound dependency on the capability and accuracy of the two cascaded stages.
One known method uses edge features to extract flat regions such as text and graphics. The extracted flat regions are represented as overlaid single coloured layers with the non-extracted regions as a background image. This method is oriented more towards image compression than document analysis, and hence does not extract graphics objects explicitly.
Another known method of classifying a rectangular non-text region uses various features extracted from the whole region such as texture, colour histogram and edge statistics to determine whether this region should be classified as a picture or a graphics. This method assumes a rectangular non-text region has been extracted for further analysis but however does not consider the possibility of text being contained in the extracted rectangular non-text region. Moreover extracting non-rectangular non-text regions is, by itself, a challenging problem.
Another known method of classifying a document image into non-rectangular text and non-text regions builds a texture vector for each pixel on the document image using M-band wavelet filters. A K-means clustering method is then used to classify each pixel into either text or non-text based on the extracted texture vectors. Although this method is able to classify a document image into non-rectangular text and non-text regions, it is silent of how to classify a non-rectangular non-text region into graphics or photo as it demonstrates that the segmented text regions overlap with photos when the document image has text embedded into photos.
There is a need to accurately extract graphics regions from a colour document image with mixed contents and complex layouts.
Due to unpredictable colour combinations and potentially unrestrained intermingling of text and non-text regions present in such documents, a problem facing document analysis systems is the increase of image reproduction artefacts such as colour bleeding and noise. Such colour pollution may have been caused by printer anti-aliasing, scanner aliasing, halftone estimation, chromatic aberrations, and other blurring, haloing or fringing effects. The increase in layout complexity and colour reproduction errors consequently makes accurate extraction of content from colour documents significantly more challenging than for traditional black and white documents.
A basic approach to determining page content operates on a binary (black and white) version of the input scan image of the document, in order to avoid colour complexity. The conventional methods for text extraction in black and white documents are considered mature, and may still produce acceptable results for documents with limited colour combinations and rectangular layout styles. However, those methods do not address the problems associated with image reproduction errors or complex colour combinations. This binarisation approach fundamentally relies on applying a threshold to the input image, and may fail to correctly distinguish between foreground regions such as text, and the surrounding background pixels.
An alternative method is to process the colour document directly, often utilising a quantised colour version of the input to reduce processing complexity and remove several reproduction artefacts, such as halftone colours. Additionally, a method utilising colour quantisation can often correctly distinguish between colours that binarisation fails to separate. However, colour quantisation in and of itself cannot solve colour misrepresentation errors present in the source document such as colour bleeding, which may be retained by the quantisation process. This could result in information bearing objects that have colour fringing effects, or that have been quantised into multiple colours instead of remaining as flat-filled objects. Text decoration effects such as outline and shadow will also be retained by the colour segmentation process, which pose challenges for character recognition methods if both the text and decoration component are to be extracted.
There is a need to accurately identify and categorise the information bearing objects contained within mixed content document images possessing complex colour layouts.