1. Field of Disclosure
The disclosure generally relates to displaying digital documents, and in particular to processing digital images of printed documents and reformatting them for display on client devices.
2. Description of the Related Art
As more and more printed documents, such as books, magazines, newsletters, and the like have been scanned as digital “images” and converted to editable text using Optical Character Recognition (OCR) technology, people have increasingly come to read such documents using computers. For display of a document on a computer screen, textual versions of a document, such as the text produced via OCR, are often preferable to the image version. Compared to the document image, text is small in size and thus can be transmitted over a computer network more efficiently, leading to quicker transmission times and hence less delay before a document is displayed. Text is also editable (e.g., supports copy and paste) and searchable. Further, text can be reformatted to be displayed clearly (e.g., using a locally available font) and flexibly (e.g., using a layout adjusted to the computer screen), thereby providing a better reading experience. The above advantages are especially beneficial to those users who prefer to read documents on their mobile devices such as mobile phones and music players, which tend to have lower network transfer speeds, less memory, and smaller screens than a typical desktop or laptop computer system, and which would therefore only with difficulty be able to display a digital image of a publication.
However, text alone, if created naively, such as via conventional OCR techniques, likewise fails to provide optimal viewability. Conventional OCR techniques attempt to produce output that precisely replicates the appearance of the original document, which in many cases does not result in maximum readability of the essential content of the document. Further, aspects of the document, such as tables of contents, page numbers, headers, and the like may not be converted accurately, and even if converted accurately, may detract from the readability of the document as a whole. This is particularly so for certain types of documents, such as old public-domain books or magazines that may be discolored by age or spotted by mold, water, etc., and may contain fonts or layouts that are not well supported by current OCR techniques. As a consequence, such old documents cannot be converted by OCR as accurately as can a new document that is in pristine condition and has expected fonts/layouts. Manual techniques for addressing the problem of accurately converting these documents, such as human hand-creation of the main text and stripping of inessential features, are cumbersome and time-consuming, and do not allow display of the document to be customized for different devices.