Publishers, government offices, and other institutions often desire to convert large collections of paper-based documents into digital forms that are suitable for digital libraries and other electronic archival purposes. In some instances, the number of documents to be converted is quite large and exceeds thousands or even hundreds of thousands of individual pages.
Computers are used to convert such large collections of paper-based documents into computer-readable formats. For example, paper-based documents are initially scanned to produce digital high-resolution images for each page. The images are further processed to enhance quality, remove unwanted artifacts, and analyze the digital images.
The digital images, however, often include errors and thus are not acceptable for digital libraries and other electronic archival purposes. Even fully automated document analysis and extraction systems are not able to generate documents that are errorless, especially when large collections of paper-based documents are being converted into digital form. By way of example, some documents contain a mixture of text and images, such as newspapers and magazines that include advertisements or pictures. Automated document analysis and extraction systems can generate errors while analyzing and extracting different portions of the documents.