The present invention is directed to a method of reducing document size for digital display and, more particularly, to a method of eliminating selected rows and columns from an image of a document page.
Electronic document systems are becoming increasingly popular for storing reference materials. Conventional electronic document systems comprise a scanner which scans an original document, digitizes each page of the document and converts it into an image comprised of picture elements, or pixels; a computer for processing the pixels and for performing any modifications to the image such as, but not limited to, size reduction; and a monitor for viewing the scanned document. The scanned documents are then stored for later retrieval.
Electronic documents can be easily retrieved from a document system's memory and viewed on a monitor. However, the amount of data contained in a standard 8.5.times.11 inch document page significantly exceeds the amount of data which can be viewed on a typical monitor. In particular, a standard image resolution for document scanners is 300 pixels per inch (conventionally referred to as dots per inch, or dpi). Thus, for an 8.5.times.11 inch document page, the scanned size is 2550.times.3300 pixels. The pixel dimensions of the screen of a conventional monitor are 1600.times.1280. Based on these measurements, it is evident that a full page cannot be completely displayed on such a monitor.
It is thus well known that if the entire document page is to be viewed on the monitor, the amount of data contained in the document page must be reduced. Indeed, one approach known in the prior art is to subsample the image in such a way as to match the pixel dimensions of the subsampled image with those of the monitor on which it is to be displayed. This approach, however, can result in a severe loss of clarity, such as edge definition, and a noticeable reduction in the size of, for example, text or other features. In the above illustration, for example, no less than 38% of the image data is lost.
In order to ameliorate this situation, the page can be processed prior to subsampling in such a way as to remove so-called low-information areas, such as areas of white space or black space. Since this will result in a certain reduction in the amount of data in the image to be subsampled, less information-bearing data is lost because the degree of subsampling required is lessened. The clarity of the page is better, and the reduction in size of the text is minimized. This technique is described more fully in L. O'Gorman, et al. "Subsampling Text Images", 1st Intl. Conf. on Document Analysis and Recognition, St. Malo, France, September, 1991, pp. 219-227.
A problem remains, however. The techniques disclosed to this point for reducing or removing the low-information areas of a page can, in many cases, distort its formatting. The term "formatting," in this context, refers to the apparent spatial and/or geometrical relationships among the major pictorial elements of the page, e.g., blocks of text, the lines of text within a block, tables and figures, columns of white space between blocks, headers, etc.--that is, the particular visual appearance of the overall page.