Conventionally, optical character recognition (OCR) has been used to convert the content of a document from one format to another. Generally, OCR refers to a mechanism of machine recognition of printed alphanumeric characters. Although OCR systems can recognize many different fonts, as well as typewriter and computer-printed characters, they are often limited to certain fonts. Advanced OCR systems are being developed that can recognize hand printing. Unfortunately, OCR systems today only provide limited capabilities to detect functional characteristics of structure (e.g., layout) of a document thereby leaving the user with a sometimes overwhelming task of reformatting the document in order to replicate the original document.
In a typical scanning operation, a bitmap is created by electronically scanning a text document. The bitmap is a binary representation in which a bit or set of bits can correspond to some part of an object such as an image or font. By way of example, in monochrome systems, one bit represents one pixel on screen. For gray scale or color, several bits in the bitmap represent one pixel or group of pixels. Although a bitmap is most often associated with graphics objects, in which the bits are a direct representation of the picture image, bitmaps can be used to represent any portion of a document. In doing so, each bit location is assigned a different value or condition.
When a text document is scanned into a computer, it is turned into a bitmap, which, as described above, can represent an image of the text. Subsequently, the OCR software can analyze the light and dark areas of the bitmap in order to identify each alphabetic letter and numeric digit. When the OCR system recognizes a character, it converts it into ASCII text.
Although extremely limited, conventional OCR systems are oftentimes used in converting standard formats, such as portable document format (PDF), into text. This task is very difficult because all the structure of the document is lost when the document is rendered for the purpose of OCR. That structure must therefore be inferred or recovered reliably if the document is to be repurposed. A more standard approach is to write a converter that is knowledgeable of the original format and does the conversion by “parsing” or interpreting the commands in the original format. The problem with doing this is that universality is lost: the conversion then depends on the specifics of the format which are subject to change and are different with every format.
As stated above, although OCR has been employed in the past to parse and convert text into a target format, these systems do not consider the originating and/or target formats for the documents. Additionally, conventional systems parse the format of the original document (e.g., PDF). As such, it is imperative that the system is knowledgeable of the source format and must continually maintain compatibility with any changes therein.