A document recognition system is a system that takes as input a digitized image of a document and outputs a text-based digital representation of the document. The output representation captures, at least, the text in the document as recognized by the system, but may also capture the graphics and layout of that document as recognized by the system. The output produced is in a format suitable as input to some target text processing application. That target application may, for example, be a word processor, text editor, or a spreadsheet program.
A hypothetical document recognition system might typically be logically and structurally decomposed into four subsystems, a segmentation subsystem, a character recognition subsystem, a format/layout analysis subsystem, and an output subsystem.
The segmentation subsystem is usually at the front end of the system and serves to segment the image into distinct text regions and graphics regions. This invention would be integrated as part of this subsystem. This invention extends the capability of the segmentation subsystem to segregate line drawing regions from text regions.
The character recognition subsystem analyzes the imaged text rendered in a particular region of the image with the purpose to produce as output the underlying text corresponding to that imaged text. It converts the image of text to character codes for that text. The character recognition subsystem should be restricted to processing purely text regions. The character recognition subsystem should not process any of the graphical (non-text regions) in the document, it should ignore such regions. If the character recognition system were to encounter a graphics region, it would assume that the region is text and may generate spurious characters. Also, the character recognition subsystem may waste CPU time trying to nonsensically interpret graphical elements in such a region as text. By improving the ability of the segmentation subsystem to efficiently distinguish graphics regions from text regions, this invention reduces the processing load of the character recognition subsystem. It also has the potential to increase the accuracy of the output of this subsystem by eliminating the generation of any spurious characters resulting from processing a graphics region.
The format/layout analysis subsystem analyzes the output of both the segmentation and character recognition subsystems in an attempt to capture the format and layout of the document in a representation internal to the system. The layout analysis subsystem may need information on the position and location of all graphical regions so to properly construct a representation of the document. For example, such knowledge is critical in determining which text regions are captions (which would be treated specially because they do not form part of the main text flow of the document).
The document output subsystem converts this internal representation to produce an output that is suitable for a particular target document processing application. It may be desirable to "cut" graphics regions out of the document image so that they can be "pasted" as image into the text document representation that is output by the document recognition system if the output representation accommodates this. Since this invention extends the class of regions that are accurately classified as having dominantly graphical content, the invention will enhance this aspect of the output subsystem.
A document recognition system may retain graphics regions as digital image throughout the system. These regions can be embedded as image in the output that the target application receives or can be entirely dropped from the output representation.
The graphical regions of the image may be broadly classified into two types, those that are representations of continuous-tone images, such as photographs, and those that are not intrinsically continuous tone in origin, such as cartoons, drawings, maps, flowcharts, and diagrams. This disclosure refers to the graphic elements that are not continuous tone in nature as line drawings.
In this disclosure it is assumed that the image being processed is binary, meaning it has a depth of 2 and pixels can only take on values from the set {0,1} representing white and black respectively. In a binary image, continuous tone regions from a grayscale image may be rendered as halftones where the ratio of white to black pixels in a small area reflects the gray level in the grayscale rendering over that area. This technique is called dithering. Line drawings and text are intrinsically not continuous in tone, such regions from the image are ideally rendered from a grayscale image by thresholding.