Scanned documents are often large, from typically 2 million to 200 million pixels (or samples). Some applications benefit from displaying more compact images representing documents on displays with significantly fewer pixels, called herein constrained displays. Constrained displays can be displays with physically limited number of pixels, such as screens on PDAs, mobile devices, cellphones, office equipment such as digital copier front panels, etc. For example, many PDAs currently have less than 100,000 pixels. Constrained displays can be regions in a larger physical display (e.g., a high resolution monitor, printed page, etc.). Graphical User Interfaces (GUIs) may have regions (e.g., icons, search results, etc.) associated with documents. One type of constrained display is a region for displaying a thumbnail image. A thumbnail image (or thumbnail) is typically 3,000 to 30,000 pixels. Constrained displays may be those in which only the width and height available in the display is not as large as the documents or images being displayed.
A thumbnail is a small image representation of a larger image, usually intended to make it easier and faster to look at or manage a group of larger images. Most thumbnails are usually just downsampled versions of the original picture. In other words, traditional thumbnails rescale an entire document to a required width and height and typically preserve the aspect ratio. Software packages for web thumbnail creation are available that focus on speeding up the process of thumbnail generation. There are also software tools (e.g., pnm tools for UNIX) that perform automatic cropping of margins.
There have been “enhanced thumbnails” to provide a better representation of documents available in HTML format. For example, see Woodruff, A., Faulring, A., Rosenholtz, R., Morrison, J., Pirolli, P., “Using thumbnails to search the Web,” Proc. SIGCHI 2001, pp. 198-205, 2001. These enhanced thumbnails are created by lowering the contrast in the traditionally created thumbnail and superimposing keywords found in HTML.
Other work has been done to create more efficient thumbnails such as Ogden, W., Davis, M., Rice, S., “Document thumbnail visualizations for rapid relevance judgments: When do they pay off?,” TREC 1998, pp. 528-534. Certain thumbnail representations have special, machine recognizable information encoded into it to allow retrieval of original documents from scanning or other machine input of the thumbnail such as Peairs, M., “Iconic Paper,” Proceedings of 3rd ICDAR 95, vol. 2 pp. 1174-1179, 1995.
Other work has been done to create new uses of traditional thumbnails. For example, thumbars are documents that have been reformatted to a fixed width, but unrestricted length and are used in web applications for HTML documents. Key words are displayed in different color codes in the thumbar. In general, the text is not readable. See Graham, J., “The Reader's Helper: a personalized document reading environment,” Proc. SIGCHI '99, pp. 481-188, 1999.
Often an icon identifies the type of the file (e.g., the program that created it) instead of being related to the content. In these cases, readability of the text of the original document in the thumbnail is not the goal. Thumbnail representations often have information other than readable text that allow retrieval of original documents from looking at the thumbnail.
The next decades should see a dramatic decline in the use of paper documents in favor of electronic documents. This transition of paper-to-electronic may make the design of scanned document tools strategic for companies. An important characteristic of scanned documents is that objects, especially text, are not identified and recognized in the file. It requires a post-analysis, often by Optical Character Recognition (OCR) (or more generally document analysis) software, that tries to locate and identify text characters, words and lines in scanned documents. The current use of OCR is generally to use the recognized text as a text file output, for keyword search, or as extra information, and to append the text and its location as metadata to the scanned document, as in Adobe Acrobat Capture of Adobe of Mountain View, Calif.
Document analysis system consists of two parts: layout analysis and character recognition (also called optical character recognition or OCR). The character recognition part uses language specific information to interpret characters and groups of characters to produce output in symbolic form such as ASCII. The layout analysis part consists of the steps necessary before performing the character recognition, namely grouping individual foreground pixels into characters or character elements such as strokes (connected ink blobs), finding image regions that contain text and grouping text information units like paragraphs, lines, words and characters. These units are characterized by rectangular bounding boxes. Character recognition is a difficult task and the OCR software may make several mistakes on the document. Small amounts of text in very large fonts, such as some titles, headings, etc., can be particularly difficult to recognize. This can be annoying to the user and mislead applications.
Layout analysis information has already been used for expanding white space (see Chilton, J. K., Cullen, J. F., “Expansion of White Space in Document Images for Digital Scanning Devices”), reducing white space (see U.S. Pat. No. 5,832,530 “Method and Apparatus for Identifying Words Described in a Portable Electronic Document”), or adapting to a constrained display, like a PDA display (see Breuel T. M., Janssen, W. C., Popat, K., Baird, H. S., “Paper to PDA,” IEEE 2002, pp. 476-479).
Adobe attaches OCR information to images of scanned documents in order to allow searchability of the text. The OCR information, however, is not used to create thumbnails. If OCR fails for some text, that test is not searchable.
However, no scanned document reformatting method has been done to target a two-dimensional constrained display, based on layout-analysis.