1. Field of the Invention
This invention relates to a method and apparatus for detecting line segments and particular patterns which may be present in a document and, more particularly, to such a method and apparatus having particular utility in optical character recognition wherein a document is formed of line and character indicia and wherein line indicia are discriminated from character indicia to facilitate character interpretation.
2. Description of the Prior Art
Optical character recognition has long been used for machine reading of information on a printed document to permit such information to be converted into electronic form. Typically, a document is provided with alphanumeric character indicia, punctuation indicia and line indicia, the latter usually being an underscored word or passage. To facilitate character recognition, it generally is helpful to eliminate printed lines from the document, such as the aforementioned underscoring. Of course, once a document is printed, such lines generally cannot be ignored.
Various techniques have been proposed for recognizing alphanumeric characters as well as other characters normally used to convey information by way of a printed document. Some of these techniques are described in Japanese Patent Publications Nos. 62-74181, 62-74182, 62-74183 and 62-74184. Other examples of techniques which are used to recognize printed characters are described in, for example, "Segmentation Methods for Recognition of Machine-Printed Characters", Hoffman and McCullough, IBM Journal of Research and Development, March 1971, pages 153-165; "Block Segmentation and Text Extraction in Mixed Text/Image Documents", Wahl, Wong and Casey, Computer Graphics and Image Processing, Vol. 20 (1982), pages 375-390; and "Approach to Smart Document Reader System", Masuda, Hagita, Akiyama, Takahashi and Naito, Proceedings ICTP 1985, pages 550-557. The existence of lines on the printed document, when converted into scanned image data, may detract from successful recognition of the printed characters. For example, a simple underscore may be blurred, broken or printed with non-uniform thickness. A line of this type may be erroneously interpreted as an appendant element, thus defeating the successful recognition of a nearby character.
The problem of discriminating between lines and characters is exacerbated when the line is slightly angled or tilted on the document. Moreover, lines often are printed or drawn in particular patterns, some of which patterns contain characters to be recognized and others contain graphic information, such as drawings, photographs and the like, which need not be machine-interpreted. For example, lines may be printed in a grid so as to constitute a table within which alphanumeric characters are disposed. In another example, the lines may be formed as a rectangular box, or block within which may be provided alphanumeric characters or the aforementioned graphic data. In the latter case, it is desirable to identify the block which serves as a border to the graphic data and to program the character recognition system simply to disregard all indicia within that block. For documents containing a table formed of lines, such as horizontal and vertical lines, it is desirable simply to disregard the table framework formed of those lines, thus leaving only the alphanumeric characters therein to be recognized. As described herein, such line, block and table elimination is carried out electronically on the picture data that is derived from scanning the document.
Typically, picture data is in the form of pixel information which then is translated into a bit-mapped representation of the scanned document. For example, an optical scanner may exhibit a resolution on the order of 300 dots per inch such that a typical line segment may occupy an area of the bit mapping corresponding to a width of about four dots and a length determined by the actual length of the line (for example, a line that is one inch long will occupy an area on the order of about 4.times.300 dots). Line detection techniques that have been proposed previously include the so-called five line method, the contour vector pair center detection method, the direction-wise black run shortest center detection method, the arc tracing method, the peripheral distribution detection method and the line density detection method. Many of these techniques require that the coordinates of the starting and ending points of the line be determined; and such coordinates are easily ascertained from the bit-mapped representation. However, non-uniform thickness of the line, resulting in different dot densities makes it difficult, if not impossible, for the aforementioned methods to detect the line accurately. For example, in some locations the line may exhibit a width of only one dot, in other locations the line may have a width of two dots, in still other locations the line may have a width of three dots, and so on. Hence, the line cannot be electronically erased correctly which, in turn, results in significant errors in character recognition.
Previous proposals for detecting patterns, such as the grid of a table or the outline of a block, generally have been applied to documents wherein graphic information, such as drawings or photographs, are contained within the table or block. Such proposals which include the so-called peripheral distribution method and the so-called enlarged contraction method, have met with little success when applied to those documents in which alphanumeric characters are enclosed within the table or block.