The present invention relates to processing text that includes coded and noncoded units of text.
An image of text, such as one on a piece of paper, can be converted into a digital representation by digitization. A digital representation of text can be a coded or noncoded representation.
A coded representation of text is character based; that is, it is a representation in which the text has been interpreted as characters. The characters are typically represented by character codes, such as codes defined by the ASCII or Unicode Standard character encoding standards, but may also be represented by character names. The universe of characters in any particular context can include, just by way of example, letters, numerals, phonetic symbols, ideographs, punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, and so on. A character is an abstract entity. How a character is represented visuallyxe2x80x94e.g., as a glyph on a screen or a piece of paperxe2x80x94is generally defined by a font defining a particular typeface. In digital or computer-based applications typography, a digital font, which will now be referred to simply as a xe2x80x9cfontxe2x80x9d, such as any of the PostScript(copyright) fonts available from Adobe Systems Incorporated of San Jose, Calif., generally includes instructions (commonly read and interpreted by rendering programs executing on computer processors) for rendering characters in a particular typeface. A coded representation can also be referred to as a character-based representation.
A noncoded representation of text is an image representation. It is a primitive representation in which the text is not interpreted into characters. Instead, it may be represented as an array of picture elements (xe2x80x9cpixelsxe2x80x9d). A bitmap is one primitive representation, in which each pixel is represented by one binary digit or bit in a raster. A pixel map is a raster representation in which each pixel is represented by more than one bit. An image representation of a page of text, for example, can be divided into lexical units, each of which can have a coded as well as a noncoded representation, as will be described.
Digitization of an image generally results in primitive representation, typically a bitmap or pixel map. If the image contains text, the primitive representation can be interpreted and converted to a higher-level coded format such as ASCII through use of an optical character recognition (OCR) program. A confidence-based recognition system, such as the one described in commonly-owned U.S. Pat. No. 5,729,637 (the ""637 patent), processes an image of text, recognizes bitmap images as characters, and converts the recognized bitmaps into codes that represent the corresponding characters. Some words may be recognized only with a low level of confidence or not recognized at all. When the image is displayed, low-confidence words are displayed in their original bitmap form, while words recognized with sufficiently high confidence are displayed from a rendering of their codes.
A digital representation of an image including both coded and noncoded units can be displayed on a raster output device such as a computer display or printer. This type of display, i.e., one containing both original and rendered bitmaps, will be referred to as a hybrid display. The coded units are rendered (i.e., rasterized), which may be accomplished in a variety of ways, such as by retrieving an output bitmap stored in memory for a code or by computing an output bitmap according to a vector description associated with a code. The result will be referred to as a rendered bitmap. The noncoded units are displayed in their original bitmap form, which will be referred to as original bitmaps. Typically, whole words are either rendered or left as original bitmaps for display on raster output devices.
The original and the rendered bitmaps of a hybrid display typically tend to have different optical densities. This difference in optical densities causes the appearance of the original and rendered bitmaps to differ. The resulting hybrid display may therefore lack a uniform appearance and may not be aesthetically pleasing.
In general, in one aspect, the invention features techniques that can be implemented as methods, systems, or apparatus, including computer program products and apparatus, for processing text that includes coded (character based) and noncoded (image based) representations of text. The techniques include deriving a correction factor from a coded representation of a second unit of text and an original noncoded representation of the second unit of text, and modifying a representation of a first unit of text in accordance with the correction factor, where a common font typeface is attributed to both the first and second units. Advantageous implementations include one or more of the following features. The correction factor is calculated by rendering a coded representation of the second unit of text in the font typeface to generate a rendered representation, calculating a reference ratio from the rendered representation and an optical density of an original noncoded representation of the second unit of text, and inverting the reference ratio to calculate the correction factor. Where the first unit of text includes a word, the optical density of the word is adjusted by modifying a bitmap representation of the word, where pixels are added to, or removed from, the bitmap, or the bitmap is left unchanged, according to the value of the correction factor.
In general, in another aspect, the invention features an electronic document representing text in a page description language, where the text has a first unit and a second unit, and where a common font typeface is attributed to both the first and the second unit of text. The electronic document has a coded representation of the second unit of text in characters of the common font typeface. The electronic document also has a final raster representation of the first unit of text that is a modified representation generated from an original noncoded representation of the first unit of text according to a correction factor that was computed from a noncoded representation of the second unit of text and an optical density of a rendered coded representation of the second unit of text. Advantageous implementations include one or more of the following features. The original noncoded representation of the first unit of text is derived from a scanned image of a portion of a paper document containing the first unit of text.
Advantages that can be seen in implementations of the invention include one or more of the following. The invention allows a hybrid display to appear uniform and aesthetically pleasing. The invention facilitates visual reading and recognizing of bitmap units displayed on a raster-output device.