The invention relates to downsampling documents, and in particular to downsampling using optical character recognition, font substitution and equalization.
Input/output devices in modern telecommunications and computer systems are devices by which information (e.g. text, data, video, images, etc.) can be transferred to or from the system or be displayed for further processing or interpretation, including interpretation by people using the system. The information of interest is termed a xe2x80x9cdocument.xe2x80x9d A document may be made manifest or rendered in variety of forms. For example, a document could be rendered in an analog fashion, e.g. on paper, microfiche, or 35 mm film. Alternatively, the document may be rendered as a digital bit map, e.g. a screen dump, or the document may be rendered in a character representation, e.g. ASCII, Latin 1, unicode or in a markup language such as LATEX, SGML or postscript. It is possible to convert a document in one representation to a document in another representation; however, the conversion may result in a loss of information or in the introduction of noise (e.g. the loss of resolution in documents produced by fax machines).
Importantly, a document often needs to be sampled at different rates in order to be rendered on different output devices. For example, laser printers typically produce tangible paper outputs with a resolution of 300-600 dots per inch (dpi) while the resolution of a fax output is 100-200 dpi, and the resolution of bit map terminals is 75-100 dpi. To output a document where the resolution of the document is higher than the resolution of the output device, it is typically necessary to downsample the higher resolution document so as to output only a portion of the information in the higher resolution document. Standard downsampling techniques include low pass filtering and decimation. However, these techniques do not work well for very low resolution devices.
Another downsampling technique is font substitution. This method is applied only to documents in a text or character representation (i.e. a representation in which the sequence and location of characters is known). A font is a representation of a character set (e.g. an alphabet). A font has a number of attributes: the family (e.g. Times Roman, Helvetica, etc.); the face (e.g. bold, italics, etc.); the size (e.g. 12 point, 18 point); and the resolution of the output device via which the document will be rendered. In font substitution, a character in the higher resolution document is identified, and the character is output to the lower resolution device in a font designed to be xe2x80x9cgood lookingxe2x80x9d at the lower resolution. In short, in font substitution one or more of the font attributes are changed before the characters in the document are output to the lower resolution device. The problem with downsampling by font substitution is the need to know, reliably, the position and identity of the characters so that an appropriate substitute can be selected. Such information is available in documents represented in LATEX, SGML or in some optical character recognition (OCR) systems. However, this information is typically not readily available in many types of documents, e.g. faxes. Thus, there is a need for improved methods of downsampling in order to output documents on low resolution devices.
The aforementioned problems are solved, in accordance with the principles of the invention, by a method of downsampling a component in a document where the component is in a character representation and has an associated reliability measure. The reliability measure indicates the probability that the associated character representation correctly identifies the component. The method downsamples the component by a first method of downsampling if the reliability measure is above a threshold and by a second method of downsampling otherwise.
In preferred embodiments of the invention the first method of downsampling is so-called font substitution, and the second method is so-called decimation. In a further aspect of the invention, decimation is combined with nonlinear filtering in downsampling the component.