Government agencies, corporations, publishers and other institutions often require large collections of paper-based documents to be converted into digital forms suitable for digital libraries, electronic archival purposes, further processing or the like. In some cases, the number of documents to be converted is extremely large, exceeding hundreds of thousands of individual pages.
Computers are employed to convert these large collections of paper-based documents into computer-readable formats. Typically, paper-based documents are initially scanned to produce digital high-resolution images for each page. The images are often further processed to enhance quality, remove unwanted artifacts, and analyze the digital images.
Document digitization is a process of capturing data records from digital images, physical paper, or other medium. Traditionally, one can use either a human data entry method or an automated method assisted with an optical character recognition (OCR) technology, intelligent character recognition (ICR) technology or natural handwriting recognition (NHR) technology, or a combination of them. These methods have fulfilled the demands for document digitization in cases where the fields to be captured are few or the quality of the content is sufficiently good for an aggressive OCR/ICR, or NHR system.
As recognized by those skilled in the art, OCR involves converting a digital image of textual information into a form that can be processed as textual information. Since electronically captured documents are often simply optically scanned digital images of paper documents, page decomposition and OCR are often used together to gather information about the digital image and sometimes to create an electronic document that is easy to edit and manipulate using commonly available word processing and document publishing software. In addition, the textual information collected from the image through OCR is often used to allow documents to be searched based on their textual content.
The digital images, however, often include errors and thus may not be acceptable for their intended purposes. Even today's fully automated document analysis and extraction systems are not able to generate documents that are essentially errorless, especially when large collections of paper-based documents are being converted into digital form. By way of example, some documents contain a mixture of text and images, such as newspapers and magazines that include advertisements or pictures. Automated document analysis and extraction systems can generate errors while analyzing and extracting different portions of such documents.
U.S. Patent Application Publication No. 2006/0285746 proposes a method, apparatus, and system for computer assisted document analysis. One embodiment is a method for software execution. The method is said to include selecting, in response to user input, criteria in a character recognition engine to identify suspect errors in scanned documents, executing the engine on a subset of the scanned documents to determine an accuracy of error detection using the criteria; and adjusting, in response to user input, the criteria to adjust the accuracy of identifying suspect errors.
From the foregoing it will be apparent that there is still a need for an improved system and process for document digitization and recognizing the content of electronic documents.