The present invention relates generally to a method for converting an existing paper document into electronic format. More particularly, the invention relates to a method for segmenting an existing document into zones (i.e., regions) of text and halftone.
More and more documents have been produced and/or reproduced in color as the publishing and photo-reproducing techniques have evolved. Recently, documents have been moving rapidly from hard copy format (i.e., paper) to electronic format. With this shift to electronic format, a demand for automatically converting paper documents into electronic format has arisen. Converting the entire paper document to electronic format is inefficient with respect to both the storage requirements and the difficulty in information retrieval. An alternative approach is to separate text regions from graphic regions within the original document and store the two parts differently. The text region can be recognized by optical character recognition (OCR) software, and allows title extraction, information retrieval and indexing of the text region content. To more efficiently translate paper documents to electronic format, document segmentation (i.e., the segmentation of a document into separate text and graphic regions) must be coupled with an OCR process to reduce the processing time by ensuring that the OCR software only operates on actual text regions. Conventional methods of document segmentation operate at the greyscale level and at the binary (i.e., black and white) level of paper documents. However, conventional methods are less than optimal when applied to documents printed in color.
Document image segmentation can be performed using either a top-down approach or a bottom-up approach. Conventional top-down methods split an image alternatively in horizontal and vertical directions using line and character spacing information. Such conventional methods include run length smoothing (which converts white pixels to black pixels if the number of continuous white pixels is less than a predetermined threshold) and the recursive x-y tree method (in which a projection onto horizontal or vertical direction is generated, and then the row and column structure is extracted according to projection histogram valleys corresponding to line spacings). These conventional top-down methods are sensitive to font size and type, character and line spacing, and document orientation.
Conventional bottom-up methods are usually based on connected components. This method starts by connecting parts of each individual character and then uses predetermined standards and heuristics to recognize characters and merge closely spaced characters together. Methods which connect components are time consuming, and sensitive to character size, type, language and document resolution.
One prior art texture-based document image segmentation method is described in a paper by S. Park, I. Yun and S. Lee entitled “Color Image Segmentation Based on 3-D Clustering Approach” that appeared in the August, 1998 issue of the journal PATTERN RECOGNITION. This method assumes two types of textures and separates text and graphics using a Gabor filter and clustering.
Most of the conventional methods operate only on a monochrome image and assume black characters on white background. Color document images are more complicated than such monochrome images since they typically have a complex background of varied colors and patterns, and a more complicated page layout. Further, homogenous colors observed by human vision actually consist of many variations in the digital color space, which further complicates processing and analysis of the color documents.
Accordingly, it is the object of the present invention to provide an improved method for segmenting a color document into text and graphic regions that overcomes the drawbacks of the prior art.