The extraction of information elements from a black and white document image and a subsequent automatic lay-out analysis is known, for example, from European Patent Publication No. EP 0 629 078 B, although other methods are also known from the literature. Various other methods are mentioned in the introduction of the '078 patent publication.
The known methods are usually carried out by distinguishing, in a digital image formed, for example, by scanning a document with an electro-optical scanner, groups of contiguous pixels of the same colour (“connected components”) in information-bearing (foreground) groups and background groups, and classifying the information-bearing groups into types such as (for example) characters, lines, photographs, etc. The information-bearing pixel groups, or a selection thereof, that correspond to a limited set of types, can then be extracted for a further interpretation processing.
Such methods are based on a binary image in black and white, i.e. an image of binary pixels. Such pixels have only two possible values: on or off, 0 or 1, white or black. The one possible value, for example black, is considered as information-bearing while the other value, i.e. white, is considered as non-information-bearing or background. These methods cannot be applied immediately to colour-containing digital images, because such images contain pixels with different colours which cannot immediately be divided up into the two classes of “information-bearing” and “background”. Both the information and the background can in fact be coloured, while it is not known a priori which colour has which function.
In addition, a colour image also frequently contains errors, small areas with a deviant colour, as a result of the limited resolution of the scanner, noise and/or registration errors in the printing process of the scanned colour document. This manifests itself, for example, in pixels with a transition colour along the edges of characters. Reference should be made as an example to FIG. 1, which shows a detail of a scanned document image, wherein the pixels which have a wrong colour as a result of scanner errors are shown shaded.
Areas having a wrong colour give rise to problems because they disturb the interpretation process.
Coloured images often contain very many different colours. This also gives rise to problems in extraction processes, because all the colours occurring in the image must be divided up separately into information-bearing or background. It is therefore advantageous first to quantise the set of colours occurring in a document into a limited number of colour groups. Techniques for quantising colours are described in the literature, for example in Sobottka, K. et al.: “Identification of text on colored book and journal covers”, Fifth International Conference on Document Analysis and Recognition, September 1999, pp. 57–62, and in commonly-assigned Netherlands patent application No. 1013669. In both documents the colour quantisation is carried out as a preparation for an interpretation process. According to these methods, the colours occurring in a digital image are grouped into a limited number of clusters and all the colours lying in a certain cluster are characterised by a colour code for that cluster. Locally there is then usually just a very small number of different colour codes left, so that a distinction between information elements and the background becomes much simpler.
However, this does not solve the problem of wrongly coloured areas along the edges of information elements, because they may be given different colour codes during quantisation, particularly if image elements having (practically) the same colour as the said “wrong” colour occur at other places in the image. Also, the quantisation may precisely give rise to a situation in which information elements are split up in components with different colour codes, so that an information element of this kind becomes completely unrecognisable as an entity for a further processing operation.
The mentioned Netherlands patent application proposes an after-treatment of the image subjected to colour quantisation, said after-treatment including establishing character contours by chain-coding. In this case a contour is constructed as a separation between the pixels having a colour code deviating from that of the surrounding background and the pixels with the colour code of the background. The further processing is then effected on the contours, without further looking at the original colour codes.
The disadvantage of this after-treatment method is that all the pixels deviating from the background colour are regarded as belonging to the information element or character, even if they actually belong to the background. Referring again to FIG. 1, this known method will extract the two digits as an entity and as a result errors can occur in an OCR process.
It should be noted that colour quantisation is only necessary if the image for processing contains many colours. If that is not the case, quantisation may be superfluous.