1. Field of the Invention
The present invention is directed to character recognition and, more particularly, to improved statistical-based recognition of similar looking characters.
2. Description of Related Art
Character recognition is used in many ways. Two examples of character recognition applications include optical character recognition (OCR) and pen-based computing.
In OCR, forms containing printed and handwritten characters (such as alphanumeric characters, kana characters, kanji characters, or other types of characters) may be scanned by an optical (OCR) device. An OCR device typically shines a bright light onto a form. The light reflected back determines printed information. The location of this printed information is interpreted as recognizable characters, words, etc. The interpreted information becomes electronic text which may be displayed on a computer monitor, stored in an electronic memory, or otherwise processed as ordinary electronic text.
FIG. 1 is a block diagram of a typical OCR system 50, such as may be used with the present invention. This system 50 includes a paper transport system 51. The paper transport system 51 moves the forms in the direction of the arrow past an optical scanner ("OCR scanner") 52. A preferred embodiment of the OCR scanner 52 forms a digital image of a form by illuminating the form with a bright light such as a laser light and then recording the reflected light using storage devices such as CCDs. This type of scanner may be used to form a bitonal digital image wherein each pixel is either white or black, corresponding to logic "1" or logic "0". One such OCR scanner 52 is a model TDC2610W, manufactured by Terminal Data Corp.
The scanner 52 may be connected to a processor 54, such as a general purpose computer or a special purpose hardware processing element. The processor's hardware elements may be optical processing elements or electronic processing elements such as resistor summing networks and digital logic circuits. The processor may include a microprocessor 56 and other components; a screen or monitor 58; and a keyboard or other input device 60. The processor 54 may also include a memory device 62 for storing the scanned documents. This memory device may be, for example, a disk memory, a RAM, or other memory device. The character recognition is typically performed by the processor 54.
Pen-based computing may be used, for example, in Personal Digital Assistants (PDAs) and other pen-based computing devices. FIG. 2 illustrates a pen-based PDA 70. A PDA is a portable computer which may include applications for appointment calendars, an electronic address book, a pen-based memo pad, and may also provide wireless communications, such as faxing, e-mail, and paging. These devices may receive information handwritten on a screen 72 using a stylus 74 or other pen-like device. Known pen-based computers may, for example, include word-processing devices using a stylus for input rather than (or in addition to) a keyboard. The handwritten input 76 may be interpreted using character recognition techniques and converted into electronic form. The converted text may be displayed on the screen 72 (as seen in FIG. 2), stored in memory, transmitted as a digital message, such as a page, fax, or e-mail, or otherwise handled as any other electronic text. The character recognition is typically performed in a processor in the PDA 70.
One problem involved in character recognition is distinguishing between one or more characters that are similar in appearance. Although this problem arises with all character types, it is particularly acute in oriental languages. This is because oriental languages contain a large number of characters having complex structures, many of which are very similar in appearance. For example, the Republic of China's Department of Education has identified 5,401 commonly used Chinese characters, and some business and industrial applications may commonly use more than 13,000 characters. In contrast, an alphanumeric keyboard typically contains 68 characters including letters, numbers, punctuation marks, and mathematical characters. An additional 26 characters include the upper case letters, for a total of 94 characters. A Chinese character comprises a number of radicals; a radical comprises a number of strokes. Therefore, a Chinese character typically comprises many strokes. This structure is typically much more complex than alphanumeric characters and numerals.
Character recognition usually involves comparing an image to be interpreted with data relating to known characters. From this comparison, a number of candidate characters are selected from the data which most closely resemble the image to be interpreted. A number of characters are ranked according to the comparison. The present inventors tested a character recognition system proprietary to the assignee herein using Chinese characters and found that the correct recognition percentage differed the most when determining between the top two candidate characters.
When the top five candidates were requested, the system included the correct character 97.81% of the time. When the top two candidates were requested, the correct character was included 95.16% of the time. When the single top candidate was requested, however, the correct character was presented only 90.02% of the time. A chart comparing N (where N is the number of candidates listed) with the percentage of correctly recognized characters is set out below:
______________________________________ Order of Pct. Correct First N Candidates Recognition ______________________________________ N = 5 97.81% N = 4 97.41% N = 3 96.72% N = 2 95.16% N = 1 90.02% ______________________________________
Note the drastic drop in recognition rate between N=1 and N=2.
This test illustrates that one problem with character recognition is determining between the top two candidate characters. One reason for this problem is that certain characters look very similar. As seen in FIG. 3, many Chinese characters appear very similar. FIG. 3 illustrates the Chinese characters for "self" 80, "already" 82, and "stop" 84. Other oriental languages, such as Japanese and Korean, also have very large character vocabularies containing similar-looking characters.
Incorrect character recognition may have several adverse effects on a character recognition system. If the interpreted character is not further processed after interpretation, the output will be incorrect. If the character is further processed after interpretation it may substantially slow the recognition process or result in an incorrect output. If a character is "word" or "grammar" checked, the character is checked to determine if it makes sense in the context of a full word, or a sentence or phrase, respectively. Using an alphanumeric example, an error in interpreting a character as a "c", but was really an "e", may be detected when a complete "word" is "thc", rather than "the". An error in interpreting a character as a "C", but was really an "E", may be detected in a grammar check when the resulting sentence is "Cat your dinner", instead of "Eat your dinner".
If a word check rejects the word, each character in the rejected word may be reviewed to determine which character was the likely cause of the error. When a likely cause of the error is determined, another candidate character is inserted and the word check is performed again. If a grammar check rejects a sentence, each word may be reviewed to determine the likely cause of the error. Once the probable erroneous word is located, each character may be reviewed to determine which character was the likely cause of the error in the word. When a likely cause of the error is determined, another candidate character is inserted and the word and grammar checks are performed again. Naturally, if the word having the next candidate character fails one of these checks, the process begins again. It is easy to see how incorrect character recognition may lead to significant delays in the character recognition process.
Several solutions to the problem of similar looking characters have been proposed. Many of these proposed solutions, however, are based on structural matching character recognition methods. Structural matching methods match features of the image to be interpreted with likely candidate characters. That is, characters are attempted to be distinguished by the presence or absence in the image of structural features such as strokes, dots, and radicals. Such structural matching solutions are disclosed, for example, by Jeong-Seon Park and Seong-Whan Lee "Adaptive Nonlinear Pattern Matching Method for Off-Line Recognition of Handwritten Characters," IWFHR (International Workshops On Frontiers in Handwriting Recognition) IV, pp. 165-75, 1994; and A. B. Wang, J. S. Huang, and K. C. Fan, "Optical Recognition of Handwritten Chinese Characters by Modified Relaxation", Proceeding of 1992 Second National Workshop on Character Recognition, pp. 115-58.
Although structural matching may, in some instances, provide precise discrimination between similar characters, it has several drawbacks. Two of these drawbacks are limitations in character recognition hardware, and another is the difficulty in generalizing structural differences between characters, particularly in oriental languages.
It is difficult for character recognition hardware to detect, or "extract", different features of images. The structure of characters, particularly Chinese and other oriental characters, are too complicated to be readily distinguished in this manner. Also, an optically scanned image may be distorted, damaged, or not fully detected (for example, a thin or lightly drawn line may not be detected) so that the entire structure of the character's image is not available for extraction and matching. This problem is exacerbated with handwritten characters because of the great variation in handwriting. Again using an alphanumeric example, not everyone dots his "i" or "j"; also, left-handed people may write characters in a different manner than right-handed people. Variations in handwriting may result in a lower correct recognition rate.
It is also difficult to generalize the structural differences between all sets of similar characters (one set of similar characters is shown in FIG. 3). Each set of similar-looking characters requires different structural features in order for the characters to be distinguished from one another. It may require substantial work and time to manually generalize the structural differences between similar-looking characters in order to establish a structural matching database.
In contrast to structural matching, some character recognition methods use statistics-based recognition methods. In a statistical character recognition system, an image is partitioned into a number of blocks; and a number of pixel-based features in each block are extracted. The extracted pixel-based features in each block are compared to previously obtained statistical information regarding possible characters. Candidate characters are selected according to how closely the number of features in each block statistically resemble the previously obtained information. It is simpler to "extract" scanned images using a statistical character recognition method. This method also lessens the errors caused by variations in handwriting. One such statistical character recognition system is described in T. F. Li, S. S. Yu, H. F. Sun, and S. L. Chou, "Handwritten and Printed Chinese Character Recognition Using Bayes Rule", Intelligent Systems for Processing Oriental Languages, pp. 406-11, 1992.
The problem of distinguishing similar looking characters has also been addressed using statistical methods, as well. Most of these proposed solutions define special features of the similar-looking characters. These special features do not apply to all sets of similar-looking characters, and therefore have only limited use. It has been proposed in J. Z. Hu, "Identification of Similar Characters in Handwriting and Printed Chinese Character Recognition", Chinese Information Journal, Issue 1, Vol. 9, pp. 37-41, 1995, to define different statistics for features of different characters. However, as with the structural methods, it may require substantial work and time to manually define the statistical data regarding differences between similar-looking characters in order to establish a statistical database. Moreover, the additional statistical data may use a significant amount of memory, making such a device unsuitable for applications where large memories are not available or are impractical. A PDA, for example, should be portable and reasonably priced, and may therefore have memory constraints; thus efficient memory use may be advantageous.
FIG. 4 is a block diagram illustrating a statistical character recognition device 90 having a separately compiled database of special features of similar looking characters. This device 90 may be found, for example, in the processor 54 of an OCR 50 or PDA 70. A device operating according to this method uses standard hardware well-known to those skilled in the relevant art.
A character is input into the character recognition device. The image may be an image of a printed or handwritten document scanned by an OCR (as seen in FIG. 1), an input on a screen of a pen-based computer (as seen in FIG. 2), or other image input. The image is presented to a feature extractor 92, where features of the image are "extracted". The image is scanned to determine the location of pixels.
Next, the extracted image is sent to a recognition engine 94 for recognition. The recognition engine 94 compares the extracted features of the image to statistical information regarding a set of characters contained in an original feature reference database 96. One or more candidate characters are selected based on this comparison.
A new feature definition database 100 stores statistical information regarding similar-looking characters. The database 100 is consulted to determine if a candidate character is one of the characters that is similar in appearance to other characters. If it is, the image is extracted by a similar character feature extractor 98, which looks at the scanned pixels in more detail.
Image features extracted by the similar character extractor 98 are presented to a similar character recognition module 102, which selects a top candidate character by referring to a new feature reference database 104. The new feature reference database 104 contains statistical information regarding differences between features of similar-looking characters. The top character candidate is output by the device for further processing by OCR 50 or PDA 70, such as word or grammar checking, or is stored, displayed, or otherwise used in any manner that other electronic data is stored, handled, or processed. Note that in addition to the ordinary character recognition device, three additional databases (original feature definition database 96, new feature definition database 100, and new feature reference database 104) are used to distinguish similar-looking characters. These databases use up valuable memory space.
It should be understood that the present invention may be used with any type of characters (alphanumeric, Cyrillic, Hebrew, Arabic, Greek, etc.). The invention, however, is particularly suitable for use with the unique challenges imposed by oriental languages. The invention therefore, is described with reference to Chinese characters.
It is an object of the present invention to improve character recognition rates by increasing the recognition between similar-looking characters.
It is another object of the present invention to provide a memory efficient and economical device for increasing the correct character recognition rate.
It is yet another object of the present invention to provide a statistical method for improving character recognition that relies on reference data already used by a character recognition device, thus not using any significant additional memory space.