1. Field of the Invention
The present invention relates to methods for representing documents within a computer system. More specifically, the present invention relates to a method and an apparatus for generating a synthetic font to facilitate creation of an electronic document from scanned page-images, wherein the resulting electronic document reproduces both the logical content and the physical appearance of the original document.
2. Related Art
As businesses and other organizations become increasingly more computerized, they are beginning to store and maintain electronic versions of paper documents on computer systems. The process of storing paper documents on a computer system typically involves a “document imaging” process, which converts the paper documents into electronic documents. This document imaging process typically begins with an imaging step, wherein document page-images are generated using a scanner, a copier, or a camera. These page-images are typically analyzed and enhanced using a computer program before being assembled into a document container, such as an Adobe® Portable Document Format (PDF) file.
A number of formats are presently used for document imaging. These formats include: (1) plain image, (2) searchable image (SI), and (3) formatted text and graphics (FT&G). The “plain-image” format provides a bitmap representation of the image, which is quite useful for archival applications, such as check processing.
The searchable image (SI) format uses scanned images for document display (e.g., in a document viewer), and uses invisible text derived from the scanned images for document search and retrieval. There are two common flavors of searchable image: (1) SI (exact); and SI (compact). SI (exact) maintains a bit-for-bit copy of the scanned pages, whereas SI (compact) applies lossy compression to the original page-images to produce smaller but nearly identical “perceptually lossless” page-images for document display.
Formatted text and graphics (FT&G) uses, formatted text, graphical lines, and placed images to construct representations of the original page-images. FT&G can be “uncorrected,” which means it includes suspects (word images+hidden text) in place of formatted text for low-confidence optical character recognition (OCR) results. Alternatively, FT&G can be “corrected” by manually converting suspects to formatted text. (Note that the term “OCR” refers to the process of programmatically converting scanned blobs into corresponding ASCII characters.)
When determining which document imaging format to use, a user typically considers a number of attributes of interest. For example, the attributes of interest can include the following:                (1) Display fidelity—Does the display version of the electronic document look exactly like the original scan?        (2) Display quality—Is the display version of the electronic document easy to read?        (3) Display performance—Does poor display performance (e.g., page display speed) detract from viewer satisfaction?        (4) Searchability—Can relevant text be found within a document collection and within individual documents?        (5) Production cost—How much does the document imaging process cost        (both in equipment cost and manual labor)?        (6) Reflow—Will document reflow be possible to enable viewing on mobile device?        (7) Accessibility—Is the document accessible by vision-impaired users?        (8) File size—How big is the file (smaller is better)?        
With respect to these attributes, the above-described image formats generally perform as follows:                (1) Display fidelity—SI (exact) is best; SI (compact) is OK; FT&G (corrected) is good; FT&G (uncorrected) is fair.        (2) Display quality—FT&G (corrected) is best; FT&G (uncorrected) is good; SI formats are poor.        (3) Display performance—FT&G formats are best; SI formats are fair.        (4) Searchability—FT&G (corrected) is best; others are good.        (5) Production cost—FT&G (uncorrected) and SI formats are best (i.e., cheapest); FT&G (corrected) is worst.        (6) Reflow—FT&G (corrected) is best; FT&G (uncorrected) is fair; SI formats are worst (i.e., not reflowable).        (7) Accessibility—FT&G (corrected) is best; FT&G (uncorrected) is poor; SI formats are worst (i.e., not accessible).        (8) File size—FT&G (corrected) is best (i.e., smallest); FT&G (uncorrected) is good; SI (compact) is fair; SI (exact) is poor.        
As can be seen from the list above, each of these document imaging formats has unique advantages compared to the other formats. Hence, when a user has to choose one of the document imaging formats, the user typically has to forego advantages that the user would like to have from the other formats.
Hence, what is needed is a method and an apparatus for obtaining the advantages of all of the existing document imaging formats within a single document imaging format.