1. Field of the Invention
The present invention relates to data and information processing, and more particularly to a handprinted symbol recognition system for translating free-form, unconstrained handwriting into computer compatible symbols (codes).
2. Description of the Prior Art
Recognition of patterns in the context of wide variability within classes over an extensive range of possible classes, completely unknown or random events, high levels of distortion and preprocessing artifacts, and high accuracy requirements, i.e., no mistakes, false alarms, etc., is a major research and development topic in the area of information extraction and analysis. Large amounts of data are often recorded in free-form unconstrained handwriting which subsequently must be translated to computer compatible symbology. A significant example of this is provided by the problem of computer identification of unconstrained, free-form handwritten symbols that occur on bathymetric, hydrographic and cartographic manuscripts.
Presently the reduction of hydrographic survey data to usable chart form extensively involves generation, editing, review, translation and merging of vast quantities of handwritten information. Considerable efforts have been expended to develop raster scanner technology needed to generate digital image data. As a result computer automated scanning and "digitizing" procedures could greatly improve the efficiency and throughput of these largely manual operations. The key element in such an automated system is a sufficiently reliable technique for recognizing the wide variety of unconstrained handprinting used in the preparation of bathymetric charts. The quality of the handprinting can impact upon both the efficiency and the accuracy, therefore, the accuracy should be solely dependent upon the recognition system, i.e., it should be decoupled from the efficiency.
Such automated raster scan procedures create vast quantities of data (.about.10.sup.8 bits per 40".times.60" document). In the image domain of line drawings, various data compression techniques are available; e.g., raster-to-lineal conversion, run-length coding, etc. However, for symbolic information such as depth soundings, names and oceanographic signs, the end product is a binary or computer code; e.g., the ASCII code for the letter or digit. When data is in this compact form, it can be easily stored in information data banks and can be managed (retrieved, updated, sorted, cross-referenced, etc.) under electronic/computer control.
In the area of handprinted symbol recognition it is common for techniques to achieve 95% to 98% accuracies where the data input variability is constrained. (See: C. Y. Suen et al, "Automated Recognition of Handprinted Characters--The State of the Art," Proceedings of the IEEE, Vol. 68, No. 4, April 1980.) However, results for such ideal optical character recognition data are not applicable where the requirement is for near-zero substitution errors (100% accuracy) in the presence of "random events" of their realistic data products with efficiencies in the neighborhood of 95%. Also, typical accuracy levels for machine print (very constrained/regular symbols) are from one to five errors per ten thousand and the pages must be clean and well-formatted using only certain kinds of ribbons and papers. Efficiencies are only somewhat lower, but the documents cannot contain "trash" or "unknown" symbols.
The assignment of a computer compatible code to the various symbols occurring in the cartographic manuscript input can be divided into the following scanning/recognition processing steps:
1. The handwritten document must be scanned with a suitable optical/digital system to produce a sampled image in raster format.
2. The image must be thresholded and noise must be removed.
3. Individual symbols must be located, isolated and tagged.
4. The isolated subrasters must be identified, i.e., recognized or classified.
5. The digits or "characters" making up a sounding or word ("name") must be reassembled and associated with the correct (geographic) location on the document.
The recognition portion of the overall "digitization" task should operate on individual, isolated symbols without knowledge of the specific symbols in the immediate region of the manuscript, i.e., they must be identified "out of context". The recognition of isolated symbols has two closely interacting elements: preprocessing of the subraster image and symbol identification. The preprocessing generally is concerned with raster smoothing and filling, orientation correction and "thinning". These tasks remove irrelevant character variation, simplify the character structure, reduce noise and perform an initial data compression. The basic result of the preprocessing stage is a "stick-figure image" in which the interclass variations have been accentuated to the extent possible while reducing intraclass variations.
What is desired is a recognition system that extracts features from these "thinned figures" and executes a decision mechanism based on the values of these features to produce a virtually 100% accurate result. To accomplish this result the accuracy should decoupled from the efficiency.