The invention relates to a method for compressing scanned, colored and gray-scale documents, the digital image of the scanned document being divided into three image planes. These image planes are a foreground image, a background image and a binary mask image. The mask image describe which areas of the document belong to the foreground and which to the background. Components of the foreground are text and graphic elements. The color and intensity of these foreground areas is described by the foreground image. The background areas include the text background as well as the pictures contained in the document. The color or brightness information of the background as well as the images contained in the document are contained in the background image. Each of the three images is coded separately with a suitable image coding method. For the decoding, the document is assembled once again from the foreground and background images. In this connection, the binary mask image describes the areas, in which the reconstructed document is to be generated from the foreground image or from the background image.
In the case of a suitable division of the document into the three image levels described, clearly better compression results can be obtained with this representation than with image coding methods which code the document as a whole. The invention describes a new method for determining the binary mask image as well as a method for the efficient division of the original document into the foreground and background images. No special assumptions are made regarding the nature or construction of the document.
Documents, which are scanned with high resolutions, require much memory for the resulting digital image data. For example, a 300 dpi scan of a colored A4 page takes up approximately 25,000,000 bytes and a colored 600 dpi scan takes up 100,000,000 bytes. Documents with amounts of data of this order of magnitude can be archived uncompressed only in small amounts. Transfer over networks with low transfer rates is practically impossible.
Lossless compression methods, such as the lossless mode of the JPEG standard (JPEG-LS) or Lempel-Ziv Welch (LZW) only make very small compression factors possible. Higher compression factors are possible only through the use of loss-affected compression methods. The DCT-based xe2x80x9cJPEG methodxe2x80x9d of the Joint Pictures Expert Group is considered to be the standard method. However, neither the JPEG method nor the newer, better wavelet-based compression method can be used for the high-grade compression of scanned documents. These strictly image compression methods presume the statistics of typical image signals, which are characterized by a high local correlation. Since these assumptions do not apply to scanned documents, the text portion of the documents is greatly changed at high compression factors, so that it becomes impossible to read the text. At the present time, documents for archiving are usually scanned in the binary mode and compressed with the CCITT fax compression standard xe2x80x9cFax Group 3xe2x80x9d or xe2x80x9cFax Group 4xe2x80x9d. In general, the readability is retained by these strictly binary compression methods. However, the brightness and color information of the image portions is lost entirely.
The Mixed Raster Content standard (MRC) (ITU recommendation T.44), which is presently in the planning stage, is a new attempt to avoid these problems. According to this standard, it is possible to divide a document into regions of different local resolution, which can be coded in different ways. One mode of the MRC standard is a multi-layer coding mode, which provides for a division of the document into three previously described planes. However, in the MRC standard, exclusively the decoding process is fixed unambiguously. The method of the division of the documents into three image planes during the coding is not specified. A method, which uses this multi-layer coding mode, is described in U.S. Pat. No. 5,779,092. However, the method specifies conditions, which are not applicable in many of the documents, which are to be processed, namely that the shape of the images is assumed to be rectangular. No provisions are made for detecting text with background images or for the presence of text within images. Furthermore, the document background basically must be white or bright.
It is an object of the invention to make possible the compression of scanned documents without being restricted by the nature of the original copies, such as a bright background, rectangular illustrations and precise separation of image and text components. Moreover, the expenditure of computing time and the use of memory shall be reduced clearly. Furthermore, different classes of documents and images shall be compressed according to a uniform method by way of a few control parameters.
A new method for compressing scanned documents is introduced. Starting out from the representation of the document in three planesxe2x80x94foreground image, background image and mask imagexe2x80x94the basic course of the compression method is presented in the following:
To begin with, a locally variable threshold value image is generated from the defined reduced original document with an adaptive threshold method, and brought back once again to the size of the original document. The original image is subsequently quantized with this threshold value image in order to produce a bitonal quantization image. By means of this quantization image and the original image, a text detection (segmenting), which divides the document into foreground and background regions, is subsequently carried out. In so doing, regions such as text and graphic elements are assigned to the foreground image, the text background and the images of the background image. The result of this segmenting is filed in the binary mask image. The binary mask image has the same size and resolution as the original image. Subsequently, a foreground image, which describes the color of the foreground regions, is produced from the original image and the mask image. Compared to the original image, this foreground image has a reduced resolution. After that, the background image is produced once again with a reduced resolution from the complement of the bitonal mask image and the original image. Subsequently, the three images are coded in each case with a suitable image coder.
The quantization or binarizing of the original document takes place in two steps. To begin with, a locally variable threshold value is determined with an adaptive method. The comparison of the gray value representation of the original document with this threshold value supplies the binary quantizing image.
If the original document exists as a colored document, it is converted, to begin with, into a gray scale image and two color difference component images. The gray scale image is used for determining the locally variable threshold value image. The gray scale image, which normally exists with a resolution of 150 to 600 dpi, is reduced to a suitable size. For this purpose, a low-pass filtering with subsequent sub-sampling is carried out. Local distortions in the original scan, such as noise or dither and raster effects are decreased by the reduction.
A local dynamics analysis is carried out next. For this purpose, the reduced gray scale image is subjected to a minimum filtration and a maximum filtration. The difference between the maximum image and the minimum image produces a dynamics image, which supplies indications regarding regions with strong edges, such as text regions.
In the next step, the dynamics image is compared with a minimum dynamic, which can be specified externally and controls the sensitivity of the quantizing and of the text detection. In regions, the dynamics of which exceed this minimum value, a quantizing threshold value is determined from half the sum of the minimum image and the maximum image. In regions, the dynamics of which are too low, the quantizing threshold value initially is set equal to zero.
Since the threshold values, so determined, are subject to strong local fluctuations and their maxima in each case are to be found in the brighter region of an edge, all threshold values, not equal to zero, are averaged as a next step. The averaged threshold values are now extended to the adjoining pixels.
In the last step, all remaining pixels, for which there is not yet a threshold value, are filled in with a value, which is formed from the average of all values not equal to zero. The newly calculated values are written back directly in the threshold value image. As a result, it is possible to determine a threshold value for all pixels of the image with only one pass.
Since the threshold value image, now determined, is smaller than the original document, it must be brought back to the original size for quantized. Enlargement takes place by means of bilinear interpolation. As a result, during the quantizing subsequently carried out, there are fewer disorders than in the case of a simple enlargement by pixel repetitions. This leads to better coding results for the binary mask image, produced in the following.
The binary quantizing image produced is the starting point for the second area of the inventive method, the text detection. It is an object of the text detection (segmenting) to produce the binary mask image, which describes a pixel-by-pixel assignment to foreground and background regions. Text and graphic structures, which were detected as belonging to the foreground, are represented in black in the binary mask image, whereas background regions are represented in white.
Pursuant to the invention, the segmentation treats all connected regions of the same value in the quantization image as possible candidates for foreground components. For this purpose, their affiliation with the foreground is investigated according to diverse criteria. The regions, segmented as foreground, are entered in the binary mask image. During the segmenting, the mask image initially is not yet binary and can assume the states of xe2x80x9cnot yet investigatedxe2x80x9d, xe2x80x9cforegroundxe2x80x9d, xe2x80x9cbackgroundxe2x80x9d and xe2x80x9cholexe2x80x9d.
As a first step, all connected regions of the quantized image are identified. For this purpose, a four-fold neighborhood is used for documents with a high resolution and an eight-fold neighborhood for documents with low resolutions.
When the regions are determined, a size filtration is carried out. Depending on the resolution of the document, there is a minimum size and a maximum size, the limits of which must not be exceeded. Regions, which lie in the permissible range, are investigated further. All other regions are discarded.
The next step is an border detection. For this purpose, all edge pixels of the region are investigated with the help of known edge filters, the edge activity being determined by the absolute value of the filter response.
In order to be able to differentiate raster effects from noise more effectively, the edge activity determined is compared with a minimum activity value. If the edge activity is less than this minimum value, it is set equal to zero.
In a further step, the average edge activity and the maximum edge activity is now determined for the border of the region and, in addition, the variance is determined for the interior of the region.
The next step is to check whether the average edge activity as well as the maximum edge activity lie above the specified minimum values and, in addition, whether the variance of the inner region does not exceed a maximum value.
In the event of a positive result, the actual region is classified as a foreground region and entered as such in the mask image. In the event that this region touches a different region, which has already been classified as foreground, it is entered as a xe2x80x9cholexe2x80x9d. If the test is negative, it is a background region, which is also entered into the mask image.
After the classification, the mask image is binarized. For this purpose, the foreground regions are set at black and all remaining regions at white.
The segmenting can be carried out in four different modes. The previously described procedure corresponds to a first mode, which is capable of detecting normal and inverse text. For many documents, such as simple letters, in which there is no inverse text, a second mode can be used, which investigates only the black connected regions of the quantized image. A third mode is meaningful for documents, which are extremely difficult to segment, such as maps. All black regions of the quantizing image are automatically assumed to be foreground regions here. By these means, satisfactory coding results can be achieved even with documents, which are difficult to segment. The fourth mode consists of classifying all pixels directly as background pixels; this is meaningful when coding strictly image documents.
In the following, the foreground and background images are determined. For this purpose, two aspects must be fulfilled. On the one hand, the foreground image and the background image should represent the intensities of the original scan as well as possible, so that visual artifacts do not result even if the segmenting has been defective. On the other hand, the structure of the resulting images must be as simple as possible, in order to ensure efficient coding.
The reduced foreground image is produced by means of the binary mask image and the original image.
Initially, all regions, belonging to the foreground, are identified in the binary mask image. Since the border values of the regions have defective intensities or colors, the foreground regions are thinned basically by one pixel. However, at least the skeleton of the region is retained.
The original image and the thinned binary mask image are divided into blocks. The edge length of the blocks corresponds here to the reduction factor. If a block contains a portion of the thinned foreground region, the average value of the corresponding pixels of the original documents is formed over this portion and written at the corresponding place in the reduced foreground image. If a block does not contain a thinned foreground region, the value of the foreground image is set equal to zero.
After that, the foreground pixels with values different from zero are expanded with an average value filtration for adjoining pixels.
In the last step, all remaining pixels are filled in. In so doing, the average value of all values not equal to zero is formed. To increase the coding efficiency, a constant gray value is included in the formation of the average value as an additional portion, in order to dampen the foreground image to gray in regions without foreground regions. Newly calculated values are written back directly in the foreground image, as a result of which it is possible to determine all pixels of the image with only one passage.
The determination of the background image corresponds, in principle, to the method, which is used for determining the foreground image, with the exception that the processing is carried out with the complement, the inverse, binary mask image.
Next, all regions belonging to the background, are identified in the binary mask image. Since the border values of the regions have defective intensities or colors, the background regions are also thin by one pixel. However, at least the skeleton of the region is retained.
The original image and the thin, inverse, binary mask image are once again divided into blocks, the edge length of which corresponds now to the reduction factor for the background image. If a block contains a portion of the thinned background region, the average value of the corresponding pixels of the original documents are formed over this portion and into the reduced background image. If a block does not contain any thinned background regions, the corresponding pixel of the background image is set equal to zero.
In the next step, the background pixels with values not equal to zero are expanded with an average value filtration for adjoining pixels.
In the last step, all remaining pixels are filled in. The average value over all values, not equal to zero, is formed in each case here. Newly calculated values are written back directly. As a result, it is possible to determine all pixels of the background image with only one pass.
When all the three images, foreground image, background image and the binary mask image, have been produced according to the inventive method, they are compressed with known image coders.
The decoding consists of decoding the foreground image, the background image and the binary mask image initially separately. The reduced background images and foreground images are enlarged by interpolation to the size of the mask image. Depending on the value of the binary mask image, the reconstructed image is then assembled from the pixels of the enlarged background image and foreground images.