1. Field of Invention
This invention is related to compression of scanned text images. In particular, this invention relates to determining the stroke width of a symbol for use in symbol identification.
2. Background
A major stumbling block to the common use of digitized images is their size. An 8.5".times.11" image at a resolution of 300 dots per inch (dpi) contains roughly 8 million pixels. Although binarization of a scanned image reduces the number of bits per pixel to one, this still requires one megabyte of memory to store. Compression techniques are typically characterized as lossless or lossy. In lossless compression, no data is lost in the compression and subsequent decompression. In lossy compression, a certain amount of data is lost. However, this loss is considered acceptable since the essence of the compressed data is retained after decompression.
Common lossless compression techniques for binary images, like CCITT Group 3 or Group 4, can compress a binary image by factors of 10 to 20. The resulting file size is unacceptably large when compared to the synthetic electronic form used to create a comparable image.
Most documents contain text. One way to compress the text of a binary image is to perform optical character recognition to create a text stream, compress the text stream using one text compression scheme and store the results. Unfortunately, the mistakes made in selection of character, font, face and position during optical character recognition are often objectionable.
Commonly assigned U.S. Pat. Nos. 5,778,095 and 5,818,965, each incorporated herein by reference in their entirety, disclose methods and apparatus for classifying symbols extracted from a scanned document into equivalence classes. An equivalence class is a set of symbols found in an image, where each symbol in the class can be substituted for another symbol in the class without changing the appearance of the image in an objectionable way. An equivalence class is represented by an exemplar. An exemplar of an equivalence class is a symbol that will be substituted for every member of the equivalence class when the image is decompressed or otherwise recreated.
The systems and methods disclosed in these applications perform run-length symbol extraction and classify the symbols into equivalence classes based on both horizontal and vertical run length information. Feature-based classification criteria for matching an extracted symbol to an exemplar is defined by a corresponding exemplar template. The exemplar template includes a plurality of horizontal and vertical template groups, each defining criteria for matching one or more symbol runs. The feature-based classification criteria use quantities that can be readily computed from the run endpoints.
Specifically, the process of matching template groups with symbol runs is identical for both horizontal/vertical template groups and horizontal/vertical symbol runs. Accordingly, the hereinafter described steps apply to both horizontal and vertical runs. The first step is to determine what the distance is between the symbol's horizontal/vertical axis and its bottom row/left column (called value X) and the corresponding distance in the exemplar (called value Y). These measurements are used to determine whether or not alignment rows/columns are added to the symbol or exemplar. If X-Y&gt;1, is true, then an alignment row/column is added to the symbol. In other words with respect to the horizontal runs, if the symbol's axis is at least 1 row closer to its bottom row than the exemplar axis is to its bottom row, start with the symbol's alignment row to align symbol with exemplar. Conversely, if Y-X&gt;1 is true, then an alignment row/column is added to the exemplar. In other words, if the exemplar's axis is at least 1 row closer to its bottom row than the symbol's axis is to its bottom row, start with the alignment row of the exemplar. Note that this corresponds to a duplicated set of the template groups used for matching the bottom row of the symbol to the exemplar.
The processing proceeds with templates and symbol runs starting at the bottom row, right column of each exemplar. A template group is obtained. It is then determined if the current run(s) in the symbol run list matches the template group criteria. As described above, more than one run can be required to match the criteria for a template group, and the match criteria is specified by the group type. If the template criteria is not met by a run in the run list, the processing continues to get a next exemplar. If the criteria is met, the runs matching the template group are consumed and the next run in the run list becomes the current run. This is to make sure that all runs match some template of the exemplar. If the symbol is small, round and dense, alignment and noise checks are performed for the row. This check is the accumulation of differences in adjacent row offsets when the current row has a non-zero adjacent row offset.
It is then determined if more template groups need to be checked. If not, it is determined if the symbol run list has been exhausted. If not, no match has occurred and the processing continues to get a next exemplar. For small, round and dense symbols, if the run list has been has been exhausted, it is determined if the accumulated offset difference is greater than a predetermined threshold. In the currently preferred embodiment, the predetermined threshold is 3/4 of the height (3/4 of the width when comparing vertical runs). If it is not a small, round and dense symbol or the predetermined threshold is not exceeded, a match has occurred and processing continues to code the symbol. If it is exceeded, the processing continues to get a next exemplar.
The feature-based classification criteria also include size, number of black pixels (mass), number of black pixels that are not on the edge (interior), a measure of the slant of the symbol, and measures of roundness and squareness (volume). Using these measures in addition to shape-based measures produces acceptable results.