Optical character recognition (OCR) refers to the process by which a document is scanned into a computer and analyzed to determine what characters appear in the document. This process eliminates the need to manually type the document into the computer system. As a result OCR systems are frequently used in situations where voluminous documents must be entered into computers or other databases for such purposes as archiving or analysis of documents. Classic OCR systems compare a scanned character to a character library to find a match for the scanned character. This classic system, while effective for standard printed characters, frequently returned erroneous results when any character set with a slight deviation from the character library was scanned. Such erroneous results require manual correction by the user, which in extreme cases could eliminate all efficiency gained from using the OCR system.
Because it is necessary in many fields to scan documents having a variety of font styles, and in some cases having handwritten data, several new OCR systems have been created. Many systems begin by attempting to break up possibly connected characters, thus correcting some of the most common errors caused by typeset or ink bleed. Because these systems were not useful for handwritten characters, but not sufficient by themselves, new methods were developed that included segmenting each character into multiple features, the relationship of the features being used in conjunction with a character library to find a character match. Other systems approximated a baseline, or other appropriate text lines, for each line of the document to ensure correct identification of the characters. Though these methods greatly improved the accuracy of OCR systems, all relied on some character or feature set that must be exactly matched to the scanned items. This greatly limited the usefulness of the systems.
U.S. Pat. No. 5,164,996, entitled “OPTICAL CHARACTER RECOGNITION BY DETECTING GEO FEATURES,” discloses a system that breaks each character in an input document into features, using the association of the features for character recognition. Specifically, each character is broken up into “bays” and “lagoons.” Based on the orientation of the “bays” and “lagoons” for each character a match is made to a character library. If no match is made, the system makes an assumption that multiple characters are represented and breaks up the character into multiple characters to attempt to find a match. This process can be repeated until a match is found. The present invention does not operate in this manner. U.S. Pat. No. 5,164,996 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,768,414, entitled “SEPARATION OF TOUCHING CHARACTERS IN OPTICAL CHARACTER RECOGNITION,” discloses a system and method for separating characters that are possibly connected. This system initially identifies all characters based on characters in a library. If characters are unidentified, a decision module attempts to separate the characters and match them to characters in the library. The separation process is repeated to attempt to identify all possibly connected characters in the input document. The method of the present invention does not use this method to identify characters. U.S. Pat. No. 5,768,414 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,774,582, entitled “HANDWRITTEN RECOGNIZER WITH ESTIMATION OF REFERENCE LINES,” discloses a system that estimates the location of the four principal reference lines used in writing English to assist in character recognition. After estimating each relevant baseline, the location and relevant proportions of features of a character with respect to the baselines are used to determine the characters of the input document. The features are compared to a feature library to find a “best match,” taken into account proportion and location information previously determined. The present invention does not use this method to recognize characters of an input document. U.S. Pat. No. 5,774,582 is hereby incorporated by reference into the present invention.
U.S. Pat. No. 5,940,533, entitled “METHOD FOR ANALYZING CURSIVE WRITING,” discloses a method of recognizing characters in cursive writing. The method first recognizes portions of letters, or primitives, and uses these primitives to construct allographs, which are typically letters. The allographs are matched to characters in a dictionary, each character being defined by a sequence of codes of primitives. This method of character recognition differs from the method of the present invention. U.S. Pat. No. 5,940,533 is hereby incorporated by reference into the specification of the present invention.
As can be seen from the prior art, optical character recognition systems place a heavy reliance on a character library for identification of data in an input document. This works very well for printed fonts, and works in some cases for cursive script as well. However, in many cases cursive script is varied and does not necessarily fall clearly into the models in the standard library. This is especially true for handwritten documents, but is also true for inherently cursive languages, such as Arabic, Hindi and Punjabi, where there are many deviations in standard writing styles. In the cases where a cursive script varies from that of the standard script in the library, the systems will either have a number of erroneously identified characters or will fail to identify several characters in the document. In most cases several processing attempts will be required before the system ultimately makes the erroneous match or determines a match cannot be made. This results in a tremendous loss of efficiency for the systems, especially when a poor result is achieved. It is therefore necessary in the art to have an efficient optical character recognition system for cursive script that does not rely on a character library to identify data in an input document.