1. Field of the Invention
This invention relates generally to the optical recognition of text. In particular, the invention provides a method for identification of characters having diacritic marks.
2. Description of the Prior Art
Applications that use optical recognition for extracting data from a document must first create an electronic copy of the document in one of the plurality of standard image formats using a scanner, facsimile machine, digital camera or other similar digitization device. Using image-processing algorithms, text characters are then isolated so that each may be individually recognized. In forms processing, isolation can occur using constrained print fields. Here, form fields are provided with certain attributes that segment the field into individually spaced regions. Boxed or combed representations suggest to the filler of the form that characters should be printed or written in these spaced regions. If form fields do not use constrained print fields, an automatic segmentation process is typically used to isolate the individual characters prior to recognition. The segmentation process uses various geometric parameters, such as line spacing, font size, average character spacing, and average character width, to box the characters into a segmented region. Whichever method is used to isolate the characters, the images of the isolated regions are digitized into the form of character bitmap; e.g. the rectangular matrix of pixels. Proprietary recognition algorithms analyze the character bitmaps to determine their computer-defined identity (code). With the identification, a computer system can output text, corresponding to a character bitmap, to an output medium.
The proprietary recognition algorithms used in the prior art use all of the character bitmap as input in making their determination (although some of the pixels may be removed through pre-recognition filtering mechanisms). Regardless of whether the character consists of a body portion (Base) only, or a Base with diacritics (marks used for providing phonetic information or distinguishing a Base), the algorithm processes all of the information in one instance.
In non-English languages, diacritics are prevalently used with many of the alphabetical characters. The classifier, a module that limits the choices of the output to a certain specified character set, must process characters both with and without diacritics. The greater the number of characters in the set, the greater the potential for recognition error and the slower the application processing speed. For instance, in the French alphabet there are 78 uppercase and lowercase letters in the alphabet. The recognition classifier has to decide between 78 different characters. Of these 78 characters, many are identical in Base but possessing different diacritics. When the diacritics are different but visually similar, selection of the correct character becomes much more difficult. This is especially true since most writers pay little attention to the quality and exactitude of the diacritic mark when putting pen to paper.