1. Field of the Invention
This invention relates generally to an apparatus and method for implementing weighted voting schemes for multiple optical character recognition processors. More particularly, this invention relates to implementing a probabilistic approach to determining the actual characters read when multiple optical character recognition processors scan identical images.
2. Discussion
Oftentimes it is both useful and necessary to transform data appearing in a hard-copy format to a format readable by and useful to computers. One common example is of text appearing on printed or published pages. A second, increasingly common example as electronic communications become more popular involves telefacsimile transmissions executed from computer to computer with no paper generation. A computer hard disk or other storage medium stores the received fax image, and a hard-copy printout is strictly optional. Though popular, telefacsimile transmissions represent data as images comprising dots or pixels arranged within the image to define the telefacsimile image, not as a string of, for example, American Standard Code for Information Interchange (ASCII) coded characters useful to word processors and the like.
In order to transform the images stored in electronic or hard-copy formats to ASCII or other coded character formats for use by word processing, database, or other coded character-based utilities, optical character recognition (OCR) processors translate electronic character images to a preferred character code format. The OCR processors scan the image to read the image characters and detect similarities to member characters of known, predetermined, character codes and output a reported character. The reported character is a prediction of what the image character read actually is. Preferably, the image or actual characters are read and reported in the same sequence as appearing on the scanned image, and the reported characters when assembled replicate the actual characters appearing in the scanned image. In the simplest optical character recognition systems, a single OCR processor or device reads the image and outputs a reported character or characters. However, due to various possible imperfections in the scanned image or the optical character recognition processor, the OCR processor may inaccurately report the characters read, introducing errors into the output coded text.
In order to increase the correlation between the read image characters and the reported characters, multiple OCR processors having various character recognition strengths and weaknesses may simultaneously or sequentially scan an identical image, and each OCR processor may in turn output a stream of reported characters corresponding to the actual characters read from the image. A post-read processor then aligns the reported character streams so that for each actual character scanned, a set of reported characters (one for each of the OCR processors) represents candidates for selection as representing the actual character read. Prior multiple OCR systems employed one of a variety of techniques to determine which character is most likely the actual character read. One technique relies on a majority-rule approach where the character most frequently reported by the OCR processors is output as the actual character read. Alternatively, because different OCR processors have different reporting accuracies, decisional algorithms weight the character reported by a particular OCR processor in accordance with an overall accuracy predetermined for each OCR processor. The weights for each reported character are then accumulated, and the reported character producing the greatest accumulated weight determines the character output by the multiple OCR system.
The above-mentioned systems do not consider that the accuracy of a particular OCR processor may depend upon the particular character read and reported by that OCR processor. In other words, it may be impractical and inaccurate to weight an OCR processor based on an overall rating. Thus, it is desirable to have a multiple OCR processor system which weights each OCR processor according to the particular reported character.