This invention relates generally to character recognition systems and more particularly to computerized character or text recognition systems.
In the prior art there is disclosed a wide variety of different types of character, image and/or pattern recognition systems, designed to read and convert pages of printed (e.g., typewritten) material into machine readable code, i.e., code capable of being "understood" or "read" by an electronic computer. These systems have become increasingly prevalent and important as input devices for use with modern computer-based systems, inasmuch as they enable pages of text to be quickly and efficiently entered into a computer's data base without requiring the time consuming and expensive process of having a keypunch operator retype the entire document on computer readable punch cards or alternatively, type the information directly into the computer through the use of an input terminal.
Although prior art character recognition systems are generally satisfactory for their intended purpose, a major limitation inherent in such systems stems from the fact that they generally utilize character masking (e.g., shape analysis) techniques to convert typewritten or printed characters into a machine readable code. The prior art systems are thus not well adapted for reading multi-styled character fonts, misprinted or misaligned characters, linked or joined characters (e.g., ligatures) and documents typed in proportional spacing, e.g., as is typically done by word processing equipment. Moreover, these prior art devices, which are designed to read only a single or a very few predetermined kinds of type fonts, are incapable of reading other type fonts for which the system is unfamiliar. Thus, these prior art systems do not have general applicability, since there are very many different kinds of type fonts in common usage.
It should also be mentioned that although the prior art character recognition systems generally do not utilize contextual information, e.g., cryptographic techniques, to identify particular characters (number or letters) being read, several prior art "spelling-checkers" and/or character recognition devices use contextual information to a limited degree to verify whether a word apparently misspelled or misprinted should be changed in view of certain information obtained from the context in which the word appears.
In that regard, in U.S. Pat. No. 4,058,795 (Balm), there is disclosed a context-aided recognition system. Although the Balm system is designed to identify an unknown character or characters by utilizing certain types of contextual information obtained by analyzing characters on the page, the successful operation of the Balm system requires that the vast majority of the characters on the page and in particular, those adjacent the unknown character or characters, be known, i.e., have been correctly determined through the use of conventional shape analysis techniques. In particular, the Balm patent system uses conventional character recognition techniques in the first instance to determine the identity of the vast majority of the characters on the page and it is only after the recognition of most of the characters that contextual techniques are used to complete the identification process. Thus, the contextual analysis as disclosed and utilized in the Balm system is the very specific and limited type of "N-gram" analysis. As is known N-gram analysis is a type of analysis which is only suitable for use once a large number of characters have already been correctly identified. Thus, in the Balm system the analysis is a deductive process in the sense that it used contextual information obtained from adjacent characters to deduce the identity of a single unknown or misrepresented character. Moreover, such action is done without producing a generalized definition enabling all glyphs (symbols of language) of a particular shape, wherever they might appear on the page, to be recognized as being that of a given character (letter or number). Contrasted thereto, the system as disclosed and claimed herein involves an inductive process in the sense that contextual analysis techniques are used not primarily for purposes of decrypting any particular unknown word or letter, but rather are used to build a "construct" capable of identifying all similarly shaped glyphs, wherever they might appear on the page, as being that of a particular character. Thus, the cryptographic techniques as used and disclosed herein, produce generalized solutions used for identifying unknown glyphs.
Other prior art systems are disclosed in U.S. Pat. Nos. 4,010,445 (Hoshino) and 3,995,254 (Rosenbaum). These systems relate to what can generally be referred to as spelling checkers or word verification systems, used for correcting misspelled or incorrectly recognized letters. Although these systems and others disclosed in the prior art use a dictionary "look-up" process to compare input words with a dictionary list to determine whether certain words are in actuality, misspelled versions of words appearing in the dictionary, these systems do not utilize generalized contextual information and further, do not use such information in the first instance for primary analysis. Thus, these prior art systems only use some contextual information to a limited degree where the vast majority of the text (i.e., the entire text but for an occasional incorrect character) has been decoded using conventional character recognition techniques, e.g., shape analysis.
To summarize, the prior art systems, although generally suitable for their intended purpose, e.g., spelling checkers, etc., are not at all suitable for being used as a text recognition system capable of recognizing characters or glyphs appearing in a wide variety of different kinds of type fonts in a wide variety of document types, e.g., books, business letters, financial reports, etc.