The proliferation of scanning technology combined with ever increasing computational processing power has lead to many advances in the area of document analysis. Document analysis systems may be used to extract semantic information from a scanned document, often by means of optical character recognition (OCR) technology. The scanning of a document results in the formation of a single layered image. Document analysis is used in a growing number of applications such as automated form reading. Such systems can also be used to improve compression of a document by selectively using a compression method appropriate to the particular content of each part of a page of the document. Improved document compression lends itself to applications such as archiving and electronic distribution.
Document analysis can typically be broken into three stages. The first of these stages involves pixel colour analysis. The second stage is document layout analysis, which identifies content types such as text, backgrounds and images in different regions of the page. The final stage uses the results of the first two stages and some further analysis to create a final output. The desired final output depends on the application.
Different methods for document layout analysis exist. Some methods partition the page into fixed sized blocks to give a coarse classification of the page. Methods such as these however can only give a single classification to a region, applying to the pixels of all colours within that region. For example, a region may be classified as containing text, but the pixels which are part of the text are not distinguished from the pixels in the background by that classification. In most such systems analysis is done in terms of a binary image, so it is clear that the text is one colour and the background another. In such cases, classifications of ‘text’ and ‘inverted text’ are sufficient to distinguish which is which. However, in a complicated multi-colour document, a single region may contain text of multiple colours, perhaps over backgrounds of multiple colours, including even natural images. In such cases, a binary image cannot be generated to sufficiently represent the text in the document without first analysing the document to determine where the text resides in different areas, which is itself the problem the system is trying to solve. In such a case, a coarse region-based classification, is not sufficient to represent the document content.
Other methods of document layout analysis use the background structure. Again however this is generally done on black and white images, and does not extend easily to complicated colour documents.
There is therefore a need for methods which provide a pixel level classification in a complicated colour document. Some methods do exist for this, however in providing an analysis at a pixel level, they generally lack context from the rest of the page, which may be helpful to the classification. Many such methods also involve a large number of operations to be applied for each pixel. For an application of document analysis embedded in a scanner, such methods may be too slow when running with the limited computational resources available inside most document scanners.
It is therefore desirable for document analysis to afford efficiency in an environment with low resources, and offer a pixel level of detail in classification. Further, it is desirable to make use of context over a large area for these classifications, and to perform well on colour documents with complicated layouts.
Some document layout analysis applications such as OCR, automatic form processing and other scan-to-high-level document processes require a segmentation step to decompose a document image from its raw pixel representation into a more structured format prior to the actual page layout analysis. This segmentation step dominates the overall speed and accuracy of these applications. Many existing applications employ a black and white segmentation, in which the document image is binarised into a binary image, consisting of black and white pixels. Regions are then formed by connected groups of black or white pixels. While binarisation is an efficient technique, it suffers from an inability to distinguish and isolate adjoining colours of similar luminance. Furthermore, it throws away much useful colour information during the process. For complex documents with multi-colour foreground and background objects, binarisation is clearly inadequate. Thus, a colour segmentation method is required.
Effective colour segmentation is a challenging problem. This is especially important for scanned images due to scanning and printing artefacts which can pose serious problems to identifying perceptually uniform colour regions. A perceptually uniform colour region in a digital image is a group of connected pixels (or pixels in close proximity) that a human observer interprets as semantically related. For example, the pixels that made up an alphabet character in a document image appear the same colour to the reader. However on a closer inspection, the number of colours is usually far higher because of printing and scanning artefacts such as halftone, bleeding, and noise. The challenge is to satisfy the competing requirements of remaining stable to local colour fluctuations due to noise, in what would otherwise be a unitary colour structure in the source image, whilst remaining sensitive to genuine changes, such as a change from white background to another light coloured background or smaller text in non constant coloured background.
Page decomposition is a form of colour segmentation that is specifically targeted at document image analysis. In addition to colour information, it uses knowledge of document layout structure, and text and non-text characteristics extensively to aid the segmentation process.
There are two main approaches to full colour page decomposition: bottom-up and top-down. The bottom-up approach examines each pixel in turn and groups adjoining pixels of similar colour values to form connected components. This method has the advantage of being efficient; however it is highly sensitive to noise and colour fluctuations because of its lack of context information. Thus, it tends to produce a large number of erroneous connected components, resulting in fragmentation. In contrast, the top-down approach partitions a page into non-overlapping blocks. Each block is analysed and given a label using local features and statistics extracted at its full or a reduced resolution. This approach only provides a coarse segmentation into regions of interest, e.g., Block A is likely to be text; and Block B is likely to be line-art. Pixel level segmentation can be achieved by further processing these regions of interest, but an additional pass is both slow in software implementations and expensive in a hardware implementation. With a single label per block, this approach is unsuitable for complex images where regions may consist of a number of different document content types, e.g., text over line-art.
There is a need for a colour segmentation technique that can decompose a document image into document object representations that can represent complex document contents with pixel level accuracy, and at the same time takes into account local context information that can be used to distinguish genuine changes from noise.