1. Field of the Invention
The present invention relates to Optical Character Recognition (OCR) systems and, more particularly, to the character segmentation devices associated with said systems.
2. Prior Art
The use of OCR devices as a means for disseminating recorded information is well known in the prior art. The OCR devices include an optical scanner which scans an original document and generates a video stream of data. The data is representative of the information recorded on the document. The individual characters associated with the video data stream are recognized and coded to minimize the amount of data which has to be transferred. The coded data is then transmitted to remote locations over telephone lines, satellites, microwave links, etc.
The segmentation process is one of the necessary steps associated with the recognition routine. The segmentation process attempts to find the boundaries for characters associated with a line of scanned data. One method used in the prior art to isolate characters is the so-called "White Space Segmentation" method. The "White Space Segmentation" method isolates character(s) based on the white space between the printed characters of a print line. One example of the "White Space Segmentation" method is described in an article entitled "Iterative Segmentation" by R. J. Baumgartner, published in the IBM TECHNICAL DISCLOSURE BULLETIN at Vol. 14, No. 9, February 1972 (pages 2643-2644). A scanner raster scans characters and passes video data to a video block processor. The processor forms nontouching character blocks. The character blocks include one or more characters. Block information including block height, position and length are calculated and are used to determine pitch. The problem associated with this type of segmentation is that touching characters are not segmented.
U.S. Pat. No. 4,083,034 discloses another prior art apparatus and method for segmenting characters. The system is able to segment touching and overlapping characters. A linear sensor array scans an informational field and generates a binary data stream therefrom. The binary data stream is circulated through a shift register memory to a scan assembly memory mosaic. A stationary memory window of the shift register memory is provided to plural trackers at the shift rate. The trackers are activated on a priority basis as center cells of the memory window satisfy a start condition and continue to trace between center cells satisfying an adjacency condition. Uppermost and lowermost center cell coordinates center cell counts and scan counts for each tracker are provided throughout a tracing operation.
A read-only memory control unit evaluates each tracker and marks valid those trackers tracing character information. Valid tracker information is then merged and evaluated to detect and locate valid characters in the binary information stream. The valid characters are centered in the memory mosaic for output to succeeding systems.
Although the latter mentioned prior art is an improvement over the "White Space Segmentation" method, it does not address the problem of segmenting documents having underscore characters and/or variable spacing between words. Variable interword spacing and underscoring are characteristics associated with a good many documents to be processed. Therefore, a complete and comprehensive segmentation device should be able to handle documents having variable interword spacing and underscoring.
Another drawback with the latter mentioned prior art is that it requires a special formatting unit to format the data outputted from the linear scanner prior to segmentation.