1. Field of Disclosure
The disclosure generally relates to the field of digitizing books and in particular to correcting errors in digital volumes produced from the books.
2. Background Information
A digital corpus may be created by scanning books and other text in one or more libraries. The pages of the books are scanned to produce digital images, and the images are then converted into digital text volumes through optical character recognition (OCR). The resulting digital volumes may then be used for a variety of purposes, such as for creating content for use with electronic reading (eReader) devices and for searching in response to queries.
Some types of library books, such as those that have fallen into the public domain, tend to be quite old. The pages of these books are frequently marked up, warped, or otherwise suboptimal for scanning purposes. In addition, since it is usually necessary to preserve these types of books, the books are scanned using non-destructive (ND) techniques that produce scans of lower quality than scans produced using destructive techniques.
The lower quality of scans for older books negatively impacts the quality of OCR performed on the books. In addition, older books often use fonts that make accurate OCR even more challenging. As a result, digital volumes produced from these types of books tend to have lower-quality OCR than volumes produced from other types of books. These lower-quality volumes are thus less suitable for reading, searching, and other purposes.