In traditional optical character recognition (OCR) environments 5, FIG. 1, a hard copy document 10 becomes digitized for computing actions, such as electronic editing, searching, storing compactly, displaying on monitors, etc. It is also digitized as a precursor to other routines, such as machine translation, data extraction, text mining, invoice processing, invoice payment, and the like. As is typical, the hard copy document is any of a variety, but is commonly an invoice, bank statement, receipt, business card, written paper, book, etc. It contains both text 7 and background 9. The text typifies words, numbers, symbols, phrases, etc. having content relating to the topic of the hard copy 10. The background, on the other hand, represents the underlying media on which the content appears. The background can also include various colors, advertisements, corporate logos, watermarks, texture, creases, speckles, stray marks, and the like.
The document 10 becomes scanned 12 which results in a grayscale or color image 14 defined by pixels. The pixels 16-1, 16-2, 16-3 . . . are many and range in volume depending upon the resolution of the scan, e.g., 150 dpi, 300 dpi, etc. Each pixel has an intensity value defined according to various scales, but a range of 256 possible values is common, e.g., 0-255. Upon binarization 20, the intensity values get converted from 0-255 to one of two possible binary values (black or white) in a binary image 22. Scanned pixel 16-3 in image 14 becomes binarized pixel 16-3′ in binary image 22 with a value of black or white, 1 or 0. In many schemes, binarization occurs by splitting in half the intensity scale of the pixels and labeling as black pixels those with relatively dark intensities and white pixels those with light intensities. At graph 25, for instance, pixels 16 of input image 14 having intensities ranging from 0-127 become labeled black during traditional binarization, while those with intensities from 128-255 become labeled white.
The result of binarization serves as input to OCR 30, which creates an output of content 40 available for further processing, storage, searching, text mining, displaying, etc. Text 41, XML 43, searchable .pdf's 45, and the like, are common types of OCR outputs. The process also often takes place through instructions, software, etc. executing on controller(s) in hardware 50, such as imaging devices e.g. multi-function printers (MFPs), all-in-ones (AIOs), copier machines, etc. Of course, not all OCR engines utilize binarization techniques. Instead, some directly take as input to their engine the grayscale or color image output from the scanning function.
Regardless of approach, the data submitted to OCR algorithms does not identify or help specify text characters or other content of the original document. Binarization is also susceptible to losing important information in low contrast regions of documents, especially where light text resides on light backgrounds or dark text resides on dark backgrounds. If a dark text pixel, say an intensity 50 of 255 (graph 25), resides nearby dark background pixels, say intensity 75 of 255, application of a global threshold of e.g. 50% results in both text and background pixels being characterized as black after binarization. That the text and backgrounds are similar in color, all text information is lost after binarization. As seen in FIG. 2, a group of pixels 60 from document 10′ have both dark text pixels 7′ (from a portion of the letter “N”) and dark background pixels 9′. If a global threshold of 50% is used to binarize this image, all pixels are homogeneously declared the same binary value, black 60′, as they all reside less than the threshold. Informational content is lost between the original text 7′ and background 9′. Similarly, informational content is lost when light text pixels or dot-matrix style pixels reside on light colored backgrounds. Traditional binarization techniques simply do not allow dark and light text to be discerned clearly enough when positioned on/nearby dark and light background regions within the same image.
What is needed is better binarization. What is further needed is better discernment of light and dark text from light and dark backgrounds to avoid loss of content. Further needs also contemplate instructions or software executable on controller(s) in hardware, such as imaging devices. Additional benefits and alternatives are sought when devising solutions.