1. Field of the Invention
The present invention relates to a system for segmenting characters and character lines from a quantized image of a scanned document effectively and at a high speed in hardware structures having general purpose processors and memories.
2. Prior Art and Problems
In OCR systems for printed characters, it is often necessary to read a large amount of machine-printed characters. However, unlike hand-written slips wherein the characters are entered in predefined frames, printed documents, such as printed slips, may not be of such form that the characters are included regularly within character frames printed with a particular dropout color. In printed slips, the characters are usually printed according to the character pitches established uniquely in the printer by which the slip is printed. Further, slips to be read out by OCR systems include not only originally printed high-quality slips, but also copied slips. In such copied documents, since it is inevitable noise components will be included, it is desirable to detect only the effective character portions, which have not been affected by noises, in scanning their images.
In reading characters with OCR systems, besides the aforementioned problems there also exists the problem of document skewing. For example, in a document-feed-type scanner, skewing may be caused when a document is fed, and in a flat-bed-type scanner, a document may be skewed when placed on a reading platen. Further, in the case of a copied document, the document may have been copied with skewing.
Generally, conventional OCR systems employ a method of segmenting wherein first a line of character areas is segmented and then segmenting is carried out on each character area from the established line of character areas by projection or the like. However, if the document is skewed and the character line is not parallel to the projecting direction, the first segmentation of the character line is difficult. This problem could be resolved by a technique of dividing a character line into several blocks and projecting each of the blocks, such as described in Japanese Published Unexamined Patent Applications Nos. 58-106,665; 58-123,169; and 58-146,973, and in an article by J. Kim, "Baseline Drift Correction of Handwritten Text", IBM Technical Disclosure Bulletin. Vol. 25, No. 10, March 1983, pp. 5111-5114. However, this prior art literature mentions nothing concretely as to how small black portions, i.e., character components, are detected. Generally, any method of determining that a character component has been detected when a pattern includes only one black dot would probably be inadequate since it would be too sensitive to noises. On the other hand, another prior method often used in image processing, which employs a mask of 3.times.3 dots or so and determines that a character component has been detected when the number of black dots existing within the mask is more than a predetermined value, would require special circuits to implement it in this application, or its processing speed would be reduced if the equivalent functions were implemented with software since bit manipulations would be required.
In segmenting each character from an established character line, the problem of document skewing is not so difficult to overcome. However, for example, in the case of a laterally printed document, wherein the characters are more narrowly spaced from each other than the character lines, a forced segmentation should be made to avoid any connection between two adjacent characters due to noises existing therebetween, U.S. Pat. No. 3,629,826 to A. Cutaia et al discloses a method for separating such adjacent characters connected with or touching each other. According to this method, parameters representing leading stroke edges and lagging stroke edges are detected from quantized video information of the characters, the determined parameters are weighted, and then gating signals for separating adjacent characters are generated based on the differences between the weighted parameters. This method requires rather complicated hardware and software. Therefore, it is desirable to find a simpler method for segmenting characters.
Another common practice in segmenting characters and character lines has been to determine the spaces between character lines and the spaces between characters by preparing histograms of black dots and comparing them with predetermined threshold values. However, to prepare the histograms, it is necessary to add the number of black dots for the entire quantized image. This would generally impose a large overhead on a microprocessor. Hence, unless a dedicated circuit is provided therefor, the processing speed for performing all the segmentations would be reduced, and even if a dedicated circuit were provided, it would further add to the cost. Accordingly, it is the object of the present invention to provide a method for segmenting character components with simpler procedures and further without adding any special dedicated circuit.