1. Field of the Invention
The present invention relates to the field of character recognition systems, in particular to a method for performing segmentation of a document image into distinguishable parts, namely text, images and lines.
2. Description of the Related Art
Optical character recognition provides for creating a text file on a computer system from a printed document page. The created text file may then be manipulated by a text editing or word processing application on the computer system. As a document page may be comprised of both text and pictures, or the text may be in columns, such as in a newspaper or magazine article, an important step prior to character recognition is document segmentation. Document segmentation is the identification of various text, image (picture) and line segment portions of the document image. As only the text portions of the document image can be converted into a text file, it is desirable to limit the character recognition to only areas of the document which have text, and to provide an order by which those portions of text are inserted into a text file. Ordering of text files is desirable in order to avoid the creation of text files that would not logically correspond to the original document. Such text files would be of diminished value.
Document Segmentation can be performed by manual, semi-automatic or fully automatic methods. Known systems use manual or semi-automatic methods. In a manual method, a document image is scanned on a scanning means coupled to a computer system, whereby a bit mapped representation is created and presented to a user via a display screen. The user specifies the text areas of the document on the computer screen using a cursor control device such as a mouse, or by providing keyboard input. In a semi-automatic method, the user may simply perform some classification or verification by interacting with the system. This may be the form of a dialog with an application program performing the character recognition.
In fully automatic systems, the process of segmentation is carried out without user interaction. Fully automatic document segmentation methods can be further categorized as either (1) top-down or (2) bottom-up. The top-down approach starts by making a hypothesis that a specific set of document layouts exist and verification is made by examining the data in more and more detail. To classify segments of the document, a back tracking scheme is used to traverse a tree-type data structure representing the document. The top-down method works well for a clearly specified set of documents with fixed layouts, but is ill-suited for situations where different document types are considered. A second top-down approach is described in an article entitled, "Image Segmentation by Shape-Directed Conversions", Baird, et. al., Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, N.J. June 1990. The method described in the article is based on a scheme called Global to Local Layout Analysis. This method is very similar to a top-down scheme, except that statistical estimation is substituted for a back tracking strategy.
Bottom-up segmentation methods are data driven, i.e. the decision making process for segmentation is dynamic and based on information derived in a prior step. One such bottom-up method is based on a Constrained Run Length Algorithm (CRLA) described in the article "Document Analysis System", by K. Y. Wong, R. G. Casey, and F. M. Wahl. IBM Journal of Research and Development, Vol. 26, No. 6, pgs. 647-656. The CRLA method is fast and accurate for some standard documents. However, the method is not designed to accommodate documents of a non-rectangular shape or those which have skew. A second bottom-up method is described in an article entitled "Improved Algorithm for Text String Separation for Mixed Text/Graphics Images", J. R. Gattiker and R. Katsuri, Computer Engineering Technical Report, TR-88-043, Department of Electrical Engineering, Pennsylvania State University. This second method is specifically designed to segment CAD/CAM documents. This method does not lend itself well to operation with general documents. Also, the method utilizes a computationally extensive character recognition algorithm to classify text areas. This causes prolonged computation times.
Known methods of fully automatic document segmentation, combine elements of the top-down and bottom-up, to solve different aspects of the task. Such a known method is described in a pair of articles by T. Pavlidis "Page Segmentation by the Line Adjacency Graph and the Analysis of Run Lengths", February 1990, and "A Vectorizer and Feature Extractor for Document Recognition", Computer Vision, Graphics and Image Processing, Vol. 35, pgs. 111-127, 1986. The method described is based on a hybrid top-down, bottom-up approach, using a Line Adjacency Graph (LAG) for image segmentation. However, the LAG approach was not designed to efficiently deal with documents containing half tone areas. Further, the LAG approach requires a large amount of workspace memory.
Known methods for document segmentation have a speed and accuracy versus system resource tradeoff. It is an objective of the present invention to provide a method and apparatus for document segmentation where speed and efficiency are obtained without requiring a high amount of system memory resource or sacrificing the accuracy of the segmentation.