1. Field of the Invention
The present invention relates to formatting text obtained by optical character recognition (OCR) from a scanned image. Such text is referred to herein as OCR text. More specifically, the invention relates to selection of a typeface and/or font for the display of OCR text. One application of the invention is to facilitate the selection of text from displayed OCR text.
2. Description of Related Art
When generating OCR text from an image, or when copying a section of OCR text, it is often desirable to maintain the original typeface and font size of the original document, to produce an OCR version of the document having the same text format.
There are many known techniques for analyzing OCR text to establish a typeface and font size, and also for copying formatted text into a computer application program (such as a word processing program). Examples of known techniques are described in the following U.S. Pat. Nos. 5,860,074; 5,185,818; 5,889,897; 5,623,681; 5,729,637; 5,625,711; 5,689,585; 5,754,873; 5,778,403; 5,438,657; 4,887,301; 5,953,453; 5,883,974; 5,668,891; 5,513,277; and 5,253,307.
The known techniques for determining the type and font size of OCR text generally rely on an analysis of individual characters (e.g., character width or thickness) to obtain a best fitting typeface and font size for the character. However, such a technique is particularly prone to errors caused by noise in the image, which might make the individual characters difficult to characterize in terms of a typeface and a font size.
It would be advantageous to provide an alternative technique which can be easier to implement and which is not limited to character-by-character analysis.
Broadly speaking, one aspect of the present invention is to determine a typeface setting and/or a font size setting for OCR text based on matching a word size instead of an individual character size as in the prior art.
Using a word size instead of an individual character can provide the following advantages:
a) Reduced processing time (as the number of words is less than the number of characters);
b) In order to match character widths, it is necessary to define exactly which part of an image corresponds to a particular character, i.e. to segment a word into individual characters. For poor quality binary images, there may be multiple candidate segmentations of an image of a word into its component characters. In contrast, it is much easier to segment an image into separate words with less ambiguity.
c) Image heights and widths are based on the number of pixels. The fractional accuracy of measuring the width of a character in an image is on average much worse than the fractional accuracy of measuring the width of a word in an image.
d) Typefaces are characterized not only by the shapes and sizes of characters but also by character spacing. Matching on the basis of character widths does not take this into account, and so loses this important information. In contrast, matching on the basis of word lengths inherently takes into account the spacing of adjacent characters in the word.
In one form, the invention provides a technique comprising:
(a) calculating for each word and for each of a plurality of possible typefaces (also referred to herein as candidate typefaces) a scaling factor to match a typeface rendering of the word to the size (e.g., width) of the word in the scanned image; and
(b) analyzing the variation of the calculated scaling factors for a typeface to identify whether the typeface is a good fit to a plurality of the words.
The term xe2x80x9ctypeface renderingxe2x80x9d means a rendering of the word when the typeface is applied thereto.
The above technique relies on the principle that it is possible to match any typeface to any word width (and height) by the use of suitable scaling factors (equivalent to a font size). However, by analyzing the scaling factors required to fit the typeface to a plurality of words, it is possible to establish whether the typeface is in reality a good match to the original typeface in the originally scanned image. If the scaling factor remains relatively constant, then the typeface is a good match. However, if the scaling factor varies considerably from word to word, then this indicates that the typeface is not a good match.
Preferably, the calculated scaling factors are analyzed by identifying one or more clusters of the scaling factors indicative of the typeface being a good fit to a plurality of the words.
Preferably, a plausible size range is used to determine the size of the word in the scanned image. Preferably, the size range is defined by a plausible minimum size and a plausible maximum size. This can take into account that, due to poor image quality or image noise, it may be difficult to determine the precise size (in pixels) of the word in the image. Nevertheless, it will be appreciated that applying such a range to a word is far more accurate than having to identify the size of an individual character (letter) in conventional techniques.
Preferably, the invention is used in combination with a word-to-word selection tool for enabling a user to select a section of text in a pseudo-image of a document in which a text portion of the document image is replaced by an OCR processed and typeface rendered image of the words of the text.