A problem of increasing importance in computer technology is the extraction of information from images which are represented as arrays of pixels (picture element intensity values). On the one hand, computer technology has successfully automated the acquisition, storage, and transmission of images of documents; on the other hand, it has even more successfully automated the storage and manipulation of information represented by strings of digital codes representing text characters.
What has been much less successfully automated is the conversion of character information in images into character-string data. Often, the same technique is used as was used to convert character information on pieces of paper into character-string data: a data entry clerk reads the image and uses a keyboard to enter character-string data equivalent to the information on the image into a data base. The difficulties with this procedure are obvious: it is expensive, slow, and error-prone.
An important component of the technology of extracting information from images is image segmentation, i.e., the division of an image into portions having different properties. FIG. 1 shows how apparatus which performs image segmentation is used in a system 101 for extracting character information from images. Image 103 is an image which is represented in the memory of a data processing system as an array of pixels. It serves as an input to segmenter 107, a program executing in a processor of the data processing system. Segmenter 107 produces a segmentation of image 103 in which text columns 105 are separated from non-text 106. A text column in the present context is one or more lines of text. In the case of a multi-line column, the lines making up the column are arranged with reference to a common vertical line. Non-text 106 may be white space or it may be illustrations, ornamental borders, patterns, or the like. The output from segmenter 107 is text column images 109, which are portions of image 103 which contain only text columns 105. Text column images 109 are then used as input to text column analyzer 111, another program executing in a processor of the data processing system. Given images of text columns, text analyzer 111 is able to interpret the images as characters, words, and lines and output digital character codes 113 corresponding to the text in the text images. The digital character codes 113 are then generally output to a text file 115. The digital character codes produced by text analyzer 111 may of course be manipulated by the data processing system in the same fashion as any other character codes. A state-of-the-art text column analyzer 111 is described in Henry S. Baird, "Global-to-Local Layout Analysis", in Proceedings of the IAPR Workshop on Syntactic and Structural Pattern Recognition, Pont-a-Mousson, France, 12-14 Sep. 1988, which is incorporated herein by reference.
Segmenter 107 is a necessary component of system 101 because text column analyzer 111 presumes that the images it is receiving represent exactly one column of text. Consequently, if text column analyzer 111 receives more than one text column 105 or non-text 106 as input, it may fail. If the input is multiple text columns 105, text column analyzer 111 may not be able to locate the lines of text, and even if it does, it will not read them in the correct order. If the input is non-text 106, text column analyzer 111 may interpret illustrations, ornamental borders, other non-textual material, or even spots of "dirt" as text. In the best case, text column analyzer 111 will fail, and will merely have wasted time and processing resources. In the worst case, text column analyzer 111 will succeed. When that happens, text column analyzer 111 may add non-existent characters to the text being extracted from the image or may even completely misinterpret the text in the image.
1. Field of the Invention
The invention relates broadly to the art of extracting information from images represented as arrays of pixels and more specifically to the art of segmenting such images in order to simplify the extraction of information from them. The techniques of the invention are particularly useful for segmenting images which contain text.
2. Description of the Prior Art
A recent survey of techniques for segmenting images which contain text, S. N. Srihari and G. W. Zack, "Document Image Analysis", Proceedings, 8th International Conference on Pattern Recognition, Paris, France, Oct. 1986, pp. 434-436, divides fully-automatic segmentation techniques into two broad categories: top-down and bottom-up. Top-down techniques begin by making high-level hypotheses about the location of text in the image (for example, that there will be double-column text with a header). They then make trees of lower-level hypotheses based on the high-level hypotheses, and continue thus downward until they reach a level where the correctness of a hypothesis may be determined by examining the document image. If the document image does not support the hypothesis, the top-down techniques back up in the tree until they reach a level which was not demonstrated false by the document image and attempt another branch of the tree. As is obvious from the foregoing, if there is a bad match between the high-level hypotheses and the actual form of the document, it will take a program using top-down techniques a great deal of time to determine the location of the text.
Bottom-up techniques begin by locating images of characters. They then form characters into words, words into lines, lines into columns, and so forth. The problem with these techniques is that they have no global view of the text in the document, and therefore often make mistakes concerning what characters make a word, what words make a line, what lines make a column, and so forth. The art has attempted to deal with these problems by using rules to prevent certain classes of mistakes. What is needed, and what is provided by the apparatus and methods disclosed herein, are techniques for segmentation which combine the simplicity and speed of the bottom-up techniques with the global view of the top-down techniques.