The invention relates to the field of resolving character ambiguities.
In modem offices, documents are routinely scanned into computers where they are stored as electronic images. While these images allow users to view the contents of the scanned documents, they do not allow users to manipulate the contents using standard word processing utilities. For example, users cannot find and replace a word in a scanned document stored as an electronic image. Optical character recognition algorithms attempt to recognize text occurring in scanned document images, and store the text in a formatted file that is comprehensible to a word processing algorithm. Examples of formatted files include Unicode files, and ASCII files. However, since optical character recognition algorithms are not 100% efficient, they occasionally misidentify words and characters in a scanned document, or encounter ambiguous words and characters which they cannot resolve.
One ambiguity encountered by optical character recognition algorithms is whether a hyphen occurring in a hyphenated word is a hard hyphen or soft hyphen. Hard hyphens are hyphens which belong in a word, such as the hyphens in the word daughter-in-law. Soft hyphens are hyphens which are inserted into a word by a word-processor or word-processing application for typesetting purposes only. Soft hyphens typically occur at the end of a line of text and are inserted to divide a word into two word fragments, the first of which remains on the current line of text followed by the soft hyphen, while the second begins a new line of text. Soft hyphens and hard hyphens are represented differently in formatted files. For example, soft hyphens are represented in Unicode formatted files by the hexadecimal word 00AD, while hard hyphens are represented by the hexadecimal word 002D. When an optical character recognition algorithm encounters a hyphen in a hyphenated word, it needs to identify the hyphen as either a hard hyphen or a soft hyphen so that it can store the hyphen with the appropriate code in the formatted output file. That way, a subsequent word processing application reading the formatted file can correctly interpret the hyphen to correctly display the word in which the hyphen appears.
Another ambiguity encountered by optical character recognition algorithms is whether white space between two characters in a string of characters is mere kerning between the characters in a word or a word separator. All characters in a typeset document are separated by white space. Within a word, the white space between characters is called kerning and is put there to give the characters and the word a visually pleasing appearance. Typically, the kerning between characters in a word is less than 1/20 of an em in length. (An em is a unit of typeset distance, and is roughly equal to the width of the letter ‘M’ in a given font and point size.) Between words, the white space in a document is called a blank space, or simply a space. Typically, blank spaces are ½ an em in length, or larger. When an optical character recognition algorithm encounters a string of characters where two or more characters in the string are separated by one or more white spaces too big to be unambiguously kerning (greater than 1/20 of an em), yet too small to be unambiguously blank spaces (smaller than ½ an em), the algorithm must correctly resolve each white space into kerning or blank space to correctly group the character string into one or more words, as appropriate.