Digital documents are generated every time a printed page or film is received by a facsimile machine, scanner, digital photocopier, or other similar digital input devices. These digital documents are composed of an array of pixels with values representing gray scale. Generally, these digital documents contain different types of image information, such as text having different background and foreground gray scale values, continuous tone images, graphics, and halftone images, which may be mixed on a document page.
Conventional facsimile machines operate on the data of digital documents to provide a representation of the document suitable for transmission and subsequent rendition by a receiving facsimile machine. These operations are often referred to as rendition methods including, for example, ordered dithering, error diffusion, and binarization (bi-level quantization). By applying these rendition methods a bit map representation of the document is formed. Typically, facsimile machines operate on digital documents by applying a single rendition method for the entire image. This fails to adequately reproduce documents having mixed types of image information because not all image types can properly be reproduced by the same rendition method. For example, binarizing may be proper for text images, but when applied to continuous tone images, gray scale transitions of the image are lost. Further, applying ordered dithering or error diffusion can halftone a continuous tone image, but applying such methods to text images causes the edge of text to blur, which sometimes results in text being illegible. Thus, applying an improper rendition method to different image components of a document produces distortions which degrade reproduction quality.
In addition, facsimile machines may compress and decompress a rendered bit map representation of documents by Group 3 or Group 4 standards. Examples of Group 3 and Group 4 standards are described in: CCITT, "Recommendation T.4, Standardization of Group 3 facsimile apparatus for document transmission," Vol. VII-Fascicle VII.3, 21-47; and, CCITT, "Recommendation T.6, Facsimile coding schemes and coding control functions for Group 4 facsimile apparatus," Vol. VII-Fascicle VII.3, 48-57. However, although data compression may be performed, poor reproduction of mixed image type document is maintained.
To improve reproduction quality, digital documents can be segmented into their image components. The resulting segments can then be classified as to image type, and different rendition methods applied to segments based on their type. Many of the proposals for segmenting a document heretofore presented are oriented towards analyzing different information in a mixed document, such as for optical character recognition (OCR) purposes.
These approaches include such methods as recursive X-Y cut (RXYC), and constrained run-length algorithm (CRLA), which is also referred to as run length smoothing algorithm (RLSA). The following literature describes RXYC: G. Nagy, S. Seth, and S. D. Stoddard, "Document analysis with an expert system," Proc. Pattern Recog. in Practice, Amsterdam, Jun. 19-21, 1985, Vol. II; and, P. J. Bones, T. C. Griffin, C. M. Carey-Smith, "Segmentation of document images," SPIE Vol 1258 Image Communications and Workstations, 78-88, 1990. CRLA is described in: F. M. Wail, K. Y. Wong, and R. G. Casey, "Block segmentation and text extraction in mixed text/image documents," Comput. Vision Graphics Image Process., vol. 20, 375-390, 1982; B. S. Chien, B. S. Jeng, S. W. Sun, G. H. Chang, K. H. Shyu, and C. S. Shih, "A novel block segmentation and processing for Chinese-English document," SPIE Vol. 1606 Visual Communications and Image Processing '91: Image Processing, 588-598, 1991; T. Pavlidis and J. Zhou, "Page segmentation and classification," CVGIP: Graphical Models and Image Processing, Vol. 54, No. 6, November 484-496, 1992; P. Chauvet, J. Lopez-Krahe, E. Taflin, and H. Maitre, "System for an intelligent office document analysis, recognition and description," Signal Processing, Vol. 32, 161-190, 1993.
RXYC and CRLA both assume an alignment of digital documents and rectangular sized segments. Accordingly, these methods have strong directional preferences, and require processing to correct improper document segmentation due to non-rectangular segments and skewing of segments from the assumed alignment. Moreover, tilting of image components for their assumed alignment in the document may result in segments having mixed image types. It would therefore be desirable to perform document segmentation which is not subject to the above limitation of document alignment or rectangular shaped segments.
Several other segmenting proposals have been oriented towards document rendition, such as performed in facsimile machines, rather than document analysis. Examples of these segmentation proposals are contained in the following publications: Y. Chen, F. C. Mintzer, and K. S. Pennington, "A binary representation of mixed documents (text/graphic/image) that compresses," ICASSP 86, 537-540, 1986; M. Yoshida, T. Takahashi, T. Semasa, and F. Ono, "Bi-level rendition of images containing text, screened halftone and continuous tone," Globecom '91, 104-109, 1991; and, S. Ohuchi, K. Imao, and W. Yamada, "A segmentation method for composite text/graphics (halftone and continuous tone photographs) documents," Systems and Computers in Japan, Vol. 24, No. 2, and 35-44, 1993.
In Ohuchi et al., a digital document is first subdivided into non-overlapping 4.times.4 pixel blocks. A block is considered a halftone block if gray level peaks appear in pixels of blocks neighboring the block. A first mask is created for the document by combining the halftone blocks to detect halftone areas. A second mask is then generated by quantizing the pixels of the document into three levels, detecting continuous black and white pixels by pattern matching of a 5.times.5 pixel block, and activating the block as an edge area once a desired pattern is detected. The two masks determine the classification of pixels. Text areas of the document are based on edge areas of the second mask and the non-halftone areas of the first mask. All areas which are not text are considered graphics. Graphics are halftoned by dithering or error diffusion, are then the document is binarized.
In Chen et al., a digital document is first subdivided into non-overlapping 4.times.4 pixel blocks. Each block is classified as text or image as follows: Two sets of four pixels are selected of a block. If any of the four pixels in each set has a gray level valve above a white threshold, the block is text. If two selected pixels from each set are below a black threshold, the block is also text. Blocks not classified as text are classified as image. Runs of horizontal image blocks shorter than 12 blocks are reclassified as text blocks. Pixels in text blocks are binarized into a first bit map, and pixels in image blocks are halftoned by error diffusion.
Further, in Yoshida et al., a digital document is segmented by first classifying each pixel as a screened or unscreened halftone pixel. The middle pixel of a 5.times.3 pixel block is classified by binarizing the pixels in the block based upon a threshold value of the average of the central 3.times.3 pixels, counting the number of transitions in both horizontal and vertical directions and then comparing the number of transitions in both directions to corresponding thresholds. If the number of transitions in each direction is greater than the threshold, the pixel is a screened halftone, otherwise it is a non-screened halftone. Classification errors are then removed by setting the middle pixel as a non-screened halftone if it is part of the image background, and by matching the 5.times.3 block to pixel patterns and setting the middle pixel accordingly if a pattern match occurs. Non-screened halftone pixels are classified as text or continuous tone by comparing attributes of the block including, maximum gray value, minimum gray value and the difference between the maximum and minimum, against three corresponding thresholds. If any attribute exceeds such thresholds, the pixel is text, otherwise the pixel is continuous tone. Screened halftone, text, and continuous tone document areas are detected using the pixel classifications. Next, the document is rendered using error diffusion with an error feedback loop, ordered dither merging, and deletion of screened frequencies, which are controlled by parameters based upon the segmentation results.
The three above described segmentation proposals have several drawbacks. First, these proposals tend to generate segments with mixed image types, such as including pixels of a continuous tone image in a text classified segments. This results in poor reproduction since a single rendition method will be applied to such a mixed segment just as when a single rendition method is applied to an entire mixed document For example, halftoning of text, in an otherwise continuous tone segment, will results in poor text quality in the reproduced document. Second, these proposals result in a bit map representation of the document by halftoning continuous tone images and binarizing text images. However, halftoning continuous tone image does not adequately represent the underlying gray scale transitions due to the excessive loss of information by converting pixel gray scale value to black and white dots of a halftone image.
In addition to the above problems, the above proposals do not accurately reproduce the shades of text images in a document. Text images possess pixels occupying predominately two gray scale levels, which represent the shade of the background and text foreground. Generally, the above three proposals, as well as facsimile machines, assume that text is always a darker shade than its background Yoshida et al, even assumes a particular range of gray levels possible for text. This fails to account for text images in which the text may be lighter than its background. Furthermore, different text image regions of a document may have different sets of background and foreground levels.
As the description proceeds the following definitions are used: "primitive document" and "document page" both refer to a digital document composed of an array of pixels having values representing the gray scale; and "smart document" refers to the document generated in accordance with this invention from a primitive document.
An advantage of the present invention is that it substantially obviates the drawbacks of the prior art for document compression and provides a system especially adapted for use in a document transmission and rendition system, such as a facsimile machine. This system provides high quality document reproduction by efficiently segmenting primitive documents without accounting for document alignment or rectangular segment shape. The system further accurately classifies primitive document segments for subsequent data compression based on segment classification, and produces smart documents with data compression at ratios equalling or exceeding those obtained with known prior art document compression techniques. This compression is achieved together with accurate reproduction documents described by data representing smart documents. Additionally, the smart documents are provided in accordance with the invention in an image data format which readily enables storage of documents. The storage format also facilitates processing of segments according to their image types. Such processing can facilitate (a) OCR (optical character recognition) of text segments, (b) image editing of gray scale segments, and (c) the conversion of documents into other representations prior to document printing.