The present invention relates to electrical communication, and more particularly to a method and apparatus for improving the information content of data by contextual analysis. The invention finds utility in diverse fields, such as digital data transmission, text correction, language translation and the recognition of speech and graphical patterns. The following description, however, will focus on one area of major applicability, that of character recognition.
Most conventional character-recognition systems recognize one input character pattern or image at a time, without considering any information relating to any surrounding character patterns. This compartmentalization of the recognition process presently appears to be the major limitation upon the ability of machines to read degraded and poorly segmented characters. The reading of cursive script by either machines or people appears to be impossible on a character-by-character basis.
Contextual analysis has been successfully used in connection with conventional character-recognition machines. In the "dictonary look-up" technique, a sequence of individually recognized characters is applied to a table containing entries against which the input sequence is matched. The table entry having the closest match to the input sequence is then chosen as the correct output sequence. This technique, however, requires extremely large dictionary tables and a large amount of time to search the tables. The dictionary is difficult to design and becomes considerably less complete as the size of the input sequence grows. Therefore, its use has been limited to special applications involving relatively small numbers of relatively short words.
A related technique employs sequences called "N-grams". In this method, a fixed number of characters surrounding the unknown character is applied to a table of possible combinations on input characters; the table then outputs that character which has the highest probability of representing the unknown character. The N-grams may be either fixed or sliding. A fixed pentagram, for example, may consider the two characters on either side of the unknown (fifth) character in order to identify the unknown character. The sliding-trigram method, on the other hand, considers first the two characters to the left of the unknown (third) character in order to make a first tentative identification of the unknown character; a second trigram then looks at the character on either side of the unknown character and makes a second estimate of its identity, and a third trigram considers the two characters to the right of the unknown character to make a third provisional identification. The final identification of the unknown character is then made from the three provisional identifications according to a predetermined set of rules. Although N-gram analysis is known to be a powerful tool, the prohibitive size of the required tables renders it impractical for all but a small number of specialized uses. A single fixed pentagram for 27 characters (26 letters and "space"), for example, requires a table having more than fourteen million entries.