As time progresses, people become more dependent on computers to help with both work and leisure activities. However, computers operate in a digital domain that requires discrete states to be identified in order for information to be processed. This is contrary to humans who function in a distinctly analog manner where occurrences are never completely black or white, but in between shades of gray. Thus, a central distinction between digital and analog is that digital requires discrete states that are disjunct over time (e.g., distinct levels) while analog is continuous over time. As humans naturally operate in an analog fashion, computing technology has evolved to alleviate difficulties associated with interfacing humans to computers (e.g., digital computing interfaces) caused by the aforementioned temporal distinctions.
Technology first focused on attempting to input existing typewritten or typeset information into computers. Scanners or optical imagers were used, at first, to “digitize” pictures (e.g., input images into a computing system). Once images could be digitized into a computing system, it followed that printed or typeset material should also be able to be digitized. However, an image of a scanned page cannot be manipulated as text or symbols after it is brought into a computing system because it is not “recognized” by the system, i.e., the system does not understand the page. The characters and words are “pictures” and not actually editable text or symbols. To overcome this limitation for text, optical character recognition (OCR) technology was developed to utilize scanning technology to digitize text as an editable page. This technology worked reasonably well if a particular text font was utilized that allowed the OCR software to translate a scanned image into editable text.
Although text was “recognized” by the computing system, important additional information was lost by the process. This information included such things as formatting of the text, spacing of the text, orientation of the text, and general page layout and the like. Thus, if a page was double-columned with a picture in the upper right corner, an OCR scanned page would become a grouping of text in a word processor without the double columns and picture. Or, if the picture was included, it typically ended up embedded at some random point between the texts. This is even more of a problem when different document construction standards are utilized. A typical OCR technique is generally unable to “convert” or properly recognize structure from another document standard. Instead, the resulting recognition attempts to confine or force recognized parts into its associated standard. When this occurs, an OCR process usually inputs “unknown” markers, such as question marks, into the recognized portions to indicate that it cannot process these components of the document.