As the use of computers and computer-based networks continues to expand, content providers are preparing and distributing more and more content in electronic form. This content includes traditional media such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, etc., that exist in print, as well as electronic media in which the aforesaid content exists in digital form or is transformed from print into digital form through the use of a scanning device. The Internet, in particular, has facilitated the wider publication of digital content through downloading and display of images of content. As data transmission speeds increase, more and more images of pages of content are becoming available online. A page image allows a reader to see the page of content as it would appear in print.
Despite the great appeal of providing digital images of content, many content providers face challenges when generating and storing the images of content, particularly when the accuracy of recognizing text in images is important. For example, to enable users to read page images from a book or magazine on a computer screen, or to print them for later reading, the images must be sufficiently clear to present legible text. Currently, the images are translated into computer-readable data using various character recognition techniques, such as optical character recognition (OCR) which includes digital character recognition. Although the accuracy of optical character recognition is generally high, some page images, even after undergoing OCR processing, are simply unreadable due to various artifacts. While manual correction is possible, the cost of manually correcting misidentified characters or inserting missing characters is extremely high especially when scanning a large volume of pages.
Another challenge faced by the digital content providers is the cost of storing images of content. To reduce storage costs, content providers desire to minimize the size of files used to store the images. Digital images may be represented at a variety of resolutions, typically denoted by the number of pixels in the image in both the horizontal and vertical directions. Typically, though not always, higher resolution images have a larger file size and require a greater amount of memory for storage. The cost of storing images of content can greatly multiply when one considers the number of images it takes to capture and store large volumes of media, such as books, magazines, etc. While reducing the size and resolution of images often reduces the requirements for storing the images, low resolution images eventually reach a point where the images, in particular any text contained therein, are difficult for readers to perceive when displayed. Content providers wishing to provide page images with text must ensure that the images can be rendered in sufficiently high resolution so that displayed text will be legible. Yet another challenge faced by the content providers is to provide page images that are scalable, i.e., that may be readily scaled up or down so as to be rendered, for example, on various-sized displays at relatively high resolution while ensuring the minimum quality and legibility of the text in the images.
What is needed is a method and system for reliably processing scanned-in page images including text so that the text in the page images, upon rendering, will be legible and in sufficiently high resolution, and further scalable, without requiring an excessive amount of memory space for storage.