A lot of the work that we do today is about information. We do research to gather information, we experiment to create information, and we communicate to share this information. As such, information comes in many forms. Unfortunately, these forms of information are not always accessible to each other. Information on paper is often seen as portable, and easy to interface with. Computers, on the other hand, are able to store much larger amounts of information and to search this faster than a person with paper, in many cases.
Many computers cannot access the vast quantities of printed information, however. One solution to is have a paperless workplace. The idea is that all information is electronic, to which computers will always have access. Paper is still a very useful way of storing and communicating data for people, though, and as such it thrives in modern workplaces. Another approach is Optical Character Recognition (OCR), in which computers are programmed to read from a paper document.
U.S. Pat. No. 5,684,891 exemplifies this approach. OCR systems often image or scan a printed document to create an electronic image of the document. Many systems start by separating the image into meaningful parts, such as text, words, or characters. Some systems, such as U.S. Pat. No. 5,212,739, are able to process a word using specialized techniques (e.g. micro-features of the represented characters). Most OCR systems, however, examine each character individually. Techniques such as matrix matching and feature extraction are used to create a representation of the character from the image. These techniques generally attempt to determine which pixels are unnecessary, or group representative pixels with concepts such as a vector, or center of mass. Often, these representations are compared to known templates, such as in matrix matching. In this way, a computer is effectively able to ask “Does this look more like an ‘a’ or a ‘b’?” Sometimes, these representations are characterized to determine a most likely match, e.g. “Which character is thin with a dot on top?”
Recently, some techniques have employed substitution ciphers, such as U.S. Pat. No. 6,658,151. These methods are often used for document matching or language identification. This may be because these results are probabilistic estimates and not definitive results. Such a technique may associate each identifier with a character one at a time.
Some of these approaches often require a large amount of image processing, a compute intensive task. Some of these approaches are often unable to identify characters in a font that is very different from those that this system has been programmed to interpret. Many of these approaches are unable to learn about a document as they scan it, e.g. if a scanner doesn't recognize an ‘e’ the first time, it will not be able to recognize any ‘e’.
Thus, there is a need for an approach to character recognition that is more computationally efficient, and is able to make use of information beyond pre-programmed representations of characters. There is also a need for character recognition that may determine multiple characters at a time or to more accurately associate identifiers with characters.