All conventional optical character recognition (OCR) systems and methods are limited in the number of the characters which can be recognized. Conventional OCR systems have been built to recognize upper and lower case alphabetical characters, digits, and punctuation marks. Conventional character recognition systems and methods are unable to recognize diacritical markers, such as a (umlaut), a (circumflex), a (tilde), a (macron), a (dot above), and a (vector), for example. Diacritical markers are used frequently in dictionaries, technical documents, and scholarly scientific journals and publications, for example.
The reason conventional OCR systems cannot recognize diacritical markers is due to the difficulty in recognizing diacritical markers without negatively impacting the recognition rate of characters A-Z and a-z, digits 0-9, and punctuation marks, such as "," "." and "?", for example. In other words, there is a limit in the amount of characters, digits and punctuation marks which an OCR system can recognize. Once the OCR system passes this threshhold, the OCR system has a harder time discriminating between characters, digits, and punctuation marks and any diacritical marker. Therefore, the speed and efficiency of recognizing any of the common characters, digits, and punctuation marks substantially decreases.
Accordingly, there exists a significant need for a method which can recognize diacritical markers without negatively impacting the recognition rate of regular characters, digits and punctuation marks.