It is common for individuals and businesses to archive physical, paper documents in digital form. For example, a multi-page paper document can be scanned and saved electronically in a digital format such as Adobe's PDF (Portable Document Format). Digital document models consider different aspects of an electronic document. For example, a three-aspect model may include the text, a raster image of each page, and the document structure. Simpler two-aspect models may include the text and a raster image of each page. The common denominator is the text. Text is used for indexing, summarization, document clustering, archiving, linking, and task-flow/work-flow assignment.
Consequently, methods and systems for recognizing and extracting text from digital images of physical documents are often rated according to their accuracy and utility. An individual scanning a document wants to be confident that all text has been recognized and extracted and wants that text to be arranged in a meaningful manner.