The present invention relates to the field of optical character recognition (OCR) of cursive, normal handwriting by individuals. More particularly, it relates to the OCR of text that is written or printed in any of a plurality of languages where letters of the alphabet, even though small in number, may assume different shapes dependent on their position within a word, and which may connect to an adjacent character at their left, right, both, or not at all. It further relates to translation from one language, as represented by cursive script, to another. The method of the invention does not attempt to segment words into characters before recognition; rather it follows the writing strokes or traces from beginning to end; and only then attempts recognition of characters in a word (as in some English script) or in a sub-word or word (as in Arabic and cursive representations of many languages). An important feature of the invention is that it recognizes that sub-words may exist in a plurality of languages, and that an existing text may contain several languages; for example, it recognizes the common phenomenon that a quotation may be in a language different from the main language of the text.
Examples of prior art directed to character segmentation are the following U.S. patents:
U.S. Pat. No. 4,024,500 granted May 17, 1977, and titled xe2x80x9cSegmentation Mechanism for Cursive Script Character Recognition Systemsxe2x80x9d.
U.S. Pat. No. 4,654,873 granted Mar. 31, 1987, and titled xe2x80x9cSystem and Method for Segmentation and Recognition of Patternsxe2x80x9d.
U.S. Pat. No. 5,001,765 granted Mar. 19, 1991, and titled xe2x80x9cFast Spatial Segmenter for Handwritten Charactersxe2x80x9d.
U.S. Pat. No. 5,101,439 granted Mar. 31, 1992, and titled xe2x80x9cSegmentation Process for Machine Reading of Handwritten Informationxe2x80x9d.
U.S. Pat. No. 5,111,514 granted May 5, 1992, and titled xe2x80x9cApparatus for Converting Handwritten Characters onto Finely Shaped Characters of Common Size and Pitch, Aligned in an Inferred Directionxe2x80x9d.
U.S. Pat. No. 5,151,950 granted Sep. 29, 1992, and titled xe2x80x9cMethod for Recognizing Handwritten Characters Using Shape and Context Analysisxe2x80x9d.
In U.S. Pat. No. 4,773,098 granted Sep. 20, 1988, and titled xe2x80x9cMethod of Optical Character Recognitionxe2x80x9d, individual characters are recognized by means of assigning directional vector values in contour determination of a character.
In U.S. Pat. No. 4,959,870 granted Sep. 25, 1990, and titled xe2x80x9cCharacter Recognition Apparatus Having Means for Compressing Feature Dataxe2x80x9d, feature vectors having components which are histogram values are extracted and compressed then matched with stored compressed feature vectors of standard characters.
U.S. Pat. No. 4,979,226 granted Dec. 18, 1990, and titled xe2x80x9cCode Sequence Matching Method and Apparatusxe2x80x9d, teaches code sequence extraction from an input pattern and comparison with a reference code sequence for character recognition.
U.S. Pat. No. 3,609,685 granted Sep. 28. 1971, and titled xe2x80x9cCharacter Recognition by Linear Traversexe2x80x9d, teaches character recognition in which the shape of the character is thinned to be represented by a single set of lines and converted to a combination of numbered direction vectors, and the set of direction vectors is reduced to eliminate redundant consecutive identical elements.
U.S. Pat. No. 5,050,219 granted Sep. 17, 1991, and titled xe2x80x9cMethod of Handwriting Recognitionxe2x80x9d is abstracted as follows:
xe2x80x9cA method of recognition of handwriting consisting in applying predetermined criterions(sic) of a tracing of handwriting or to elements of this tracing so that several characterizing features of this tracing or of these elements be determined, comparing characterizing features thus determined to characterizing features representative of known elements of writing and identifying one element of the tracing with one known element of writing when the comparison of their characterizing features gives a predetermined result, wherein the improvement consists in the setting up of a sequence of predetermined operating steps in accordance with predetermined characterizing features by applying criterions to the tracing elements.xe2x80x9d
The above United States patents are incorporated herein by reference, where permitted. None of the known prior art, however, teaches how to deal with units of interconnected text tracings wherein vectors remain unused after all characters have been recognized, nor how to deal with the appearance of multiple languages within a single document or on a single page.
It has been found that a more efficient character recognition is achieved using encoded units of interconnected text tracings as a sequence of directions in a plane where the units are recognized as sub-words, where all vectors in the text tracings are used to create the character or language fragment being recognized, and where the vector sequences are tested against one or a plurality of sets of language-specific rules.
It has further been found that the amount of pre-processing, before recognition but after acquisition of the text image and noise reduction and filtering, is reduced if the input text is not segmented into constituent characters before it is presented to the recognition engine. Thus, the natural segmentation inherent in the text image (due to spacing between words and sub-words) is adhered to and exploited.
In the present disclosure and claims, xe2x80x9csub-wordsxe2x80x9d mean the intra-connected portions of words that are bounded by a break in the cursive text, i.e. where successive characters are not bound by a ligature. Sub-words can be as long as the entire word or as short as one character, or even a portion of a character if, for example, the character includes a secondary feature.
The present invention provides an improvement to the known methods of optical character recognition in which the characters can comprise a plurality of languages, comprising an intermediate step wherein an acquired text image consisting of a sequence of planar directional vectors is analyzed by the recognition engine in chunks of intra-connected sub-words, the cursive text is parsed and a character marker is entered upon the recognition of each successive sub-word, and if unused vectors remain following the recognition of connected sub-units of text, then the text is reparsed by moving the character marker forward or backward one vector at a time until each vector in the sequence contributes to recognition of the characters of the text, as described in copending Canadian patent application S. N. 2139094. The recognition engine further uses a first set of language-specific rules, and if after exhausting the entries in the first set of language-specific rules a particular sub-word is not recognized, it compares that sub-word with a second set of language-specific rules until the sub-word is recognized.
The present invention further provides an apparatus for recognition of cursive text in one or more of a plurality of languages from a scanned image, including means for recognizing a sequence of directional vectors as characters only if all of the vectors have contributed to the recognition, means for reparsing the sequence of directional vectors until all of the vectors do contribute to recognition, at least two language-specific dictionaries, and means for comparing the sequence of direction vectors with the language-specific dictionaries. Code to control a computer for carrying out the steps of the method can be programmed onto a suitable medium, for example a magnetic storage diskette or a programmable read-only memory.
The present invention further provides a computer-usable medium containing program code executable by the computer to perform a method for recognition of cursive text in one or more of a plurality of languages from a scanned image, including reparsing a sequence of directional vectors by moving a character marker one vector at a time until each vector in the sequence contributes to recognition of the characters of the text from at least one set of language-specific rules. Examples of media suitable for the storage of such code are magnetically-encoded disks, optically-encoded disks, some forms of which are commonly called CD-ROMs, fixed disk drives and programmable read-only memories, including EPROMs, EEPROMs and flash memory cards. Such code can be readily transmitted in suitable forms, for example in binary-encoded forms on local or wide area networks or on public electronic transmission networks, for example the Internet.
The present invention further provides a computer program product comprising a computer-usable medium containing program code means for recognition of cursive text in one or more of a plurality of languages from a scanned image, the code comprising code means for causing the computer to encode text tracings as vectors, means to recognize the sequence of vectors as characters only if all vectors contribute to the recognition, means to reparse the sequence by moving a character marker, means to provide one or more sets of language-specific rules, means to compare each element of the sequence of vectors with the rules, and means to compare each element of the vector sequence to a second set of language-specific rules if the first set does not produce a match. The computer program product can be any convenient product suitable for storing and transmitting stored code, for example magnetic or optically-encoded disks or programmable read-only memories, including EPROMs, EEPROMs or flash memory cards.
Having recognized the language of the first character, the system of the invention continues to use the dictionary for that first language until it fails to obtain a match in that language. It then attempts recognition in another language until it finds a recognizable character. Thus recognition of the language and also the written text before segmentation is non-deterministic and dictated by the text itself.
Preferably, the sequence of planar directional vectors is obtained by processing according to methods known in the art: a noise-reduced and filtered digitized text image as follows:
(a) thinning or skeletonizing the text image to its essential skeleton (among other methods, for example, as taught by T. Wakayam in a paper titled xe2x80x9cA case line tracing algorithm based on maximal square movingxe2x80x9d, IEEE Transactions on Pattern Recognition and Machine Intelligence, VOL PAMI-L1, No. 1, pp 68-74);
(b) converting the thinned image to directional vectors representing the directional flow of the tracings by the sequential data stream of the digitized image (for example, directional vectors are assigned to each element of the skeleton by means of the xe2x80x9cFreeman codexe2x80x9d); and
(c) applying at least one reduction rule to the string of directional vectors to reduce it in length and yield one form of abstract representation of a word or sub-word. One simple reduction rule in a preferred embodiment specifies that a directional vector immediately following an identical directional vector be discarded. This rule may be applied recursively to a vector string, reducing it considerably.
Once the above intermediate pre-processing steps have been applied, language-specific identification of the sequence of directional vectors commences. For example, a set of language-specific grammar rules for a language in a first dictionary would include a look-up table defining each alphabet character by its most abstract (i.e. reduced) sequence of directional vectors. Further language-specific rules may restrict connectivity on either side, or may specify secondary features of a character such as a dot or dots (as in Arabic) or an accent (as in French). It is clear, therefore, that some experimentation will be necessary before arriving at an optimal set of grammar rules for a given language. The grammar rules may include provision for idiosyncrasies of individual writers; for example, some people write part of the alphabet, and print some characters, xe2x80x9crxe2x80x9d and xe2x80x9csxe2x80x9d being commonly printed in English manuscript. A second example is that some writers will cross a xe2x80x9ctxe2x80x9d with a horizontal stroke that does not intersect the vertical stroke, thus creating an additional sub-word.
In another embodiment, the invention provides a second set of language-specific grammar rules, which is accessed in turn in a way to be described below.