Optical character recognition (OCR) systems and digital image processing systems are known for use in automatic forms processing. These systems deal with three kinds of data: physical data, textual data, and logical data. Physical data may be pixels on a page or positional information related to those pixels. In general, physical data is not in a form to be effectively used by a computer for information processing. Physical data by itself has neither useable content nor meaning. Textual data is data in textual form. It may have a physical location associated with it. It occurs in, for example, ASCII strings. It has content but no meaning. We know what textual data says, but not what it means. Logical data has both content and meaning. It often has a name for what it is.
For example, there may be a region of black pixels in a certain location on an image. Both the value of the pixels and their location are physical data. It may be determined that those pixels, when properly passed through a recognizer, say: “(425) 867-0700.” Content has been derived from the physical data to generate textual data. If we now know that text of this format (or possibly at this location on a preprinted form) is a telephone number, the textual data becomes logical data.
To facilitate reconciliation of imperfections in physical data and shortcomings of the recognition process, each recognized element of textual data, e.g., a character, may be represented by a ranked group of unique candidates called a “possibility set.” A possibility set includes one or more candidate information pairs, each including a “possibility” and an associated confidence. In the context of an OCR system, the confidence is typically assigned as part of the recognition process. For computational efficiency, the confidences may be assigned within an appropriate base-2 range, e.g., 0 to 255, or a more compact range, such as 0 to 7. For example, FIG. 1 shows an enlarged view of an individual glyph 20 that may be physically embodied as a handwritten character or as a digital pixel image of the handwritten character. From glyph 20, an optical character recognition process may generate the possibility set shown in TABLE 1 by assigning possibilities and associated confidences:
TABLE 1possibilityconfidencec200e123o100
FIG. 2 shows a series of sibling glyphs 22, which are known as “siblings” because they share the same parent word 24. The sibling glyphs 22 can be represented by the four possibility sets as shown in the following TABLE 2:
TABLE 2possconfpossconfpossconfpossconfc200h190o100r125o150n100a80n100e100r80The possibilities of these four possibility sets can be readily combined to form 36 unique strings: “chor”, “ohor”, “ehor”, “cnor”, “cror”, etc. The number of unique strings is determined by the product of the number of character possibilities in each possibility set, i.e., 3×3×2×2=36.
To gage or verify their accuracy, the unique “candidate” strings may be processed by a “dictionary” of valid outcomes. In the context of OCR, a dictionary is a filter. It has content and rules. Each candidate string processed by the dictionary is subject to one of three possible outcomes: it is passed, it is rejected, or it is modified into a similar string that passes. One example of a dictionary is based on the English language. For parent word 24 of FIG. 2, the candidate strings “chor” and “ehar” would be rejected by such a dictionary, while “char” would be passed.
Because dictionaries often have a very large amount of content against which a candidate string is compared, it may be unduly time-consuming to apply the dictionary to all possible strings. To improve efficiency, it is desirable, before applying a dictionary, to rank the candidate strings in order of some confidence based on the accuracy of recognition. In this way the candidate strings having the highest confidence of having been accurately recognized are processed by the dictionary first. Rules can then be used to determine when to stop dictionary processing, e.g., when enough candidate strings have been processed to have isolated the best candidate strings (with a certain probability). A convenient way to rank candidate strings is to calculate string confidences based on the confidences of the component character possibilities that make up each candidate string. A set of candidate strings and their associated string confidences is referred to as an “alt-set.”
One way to rank parent candidates for creating an alt-set is to add the child confidences for each parent candidate. In the above example, “chor” would have a ranking of 615 (the sum of the confidences associated with the individual characters c-h-o-r), “ohor” would have a ranking of 565, “ehor” would have a ranking of 515, etc. Combining the possibility sets to form the 36 unique strings and to calculate their rankings is simple in this example. However, there is no obvious way to read the strings out in ranked order. The strings must first be assigned a ranking, then ordered or sorted based on their assigned rank. This ordering or sorting step becomes especially problematic for longer strings formed from sibling possibility sets having a greater number of possibilities. By way of illustration, a hypothetical 10-character parent word in which each child possibility set includes 10 possibilities would result in 10 billion unique strings. It would be a very time-consuming and computationally expensive task to rank and order 10 billion 10-character strings.
Another known way of improving the efficiency of dictionaries is to use specialized dictionaries that contain smaller amounts of content than a more generalized dictionary but that are limited in their application. One such specialized dictionary is an “n-gram” dictionary, which includes information about the frequency in which certain character sequences (e.g., two-letter, three-letter, etc.) occur in the English language. For example, the two-letter combination “Qu” (a 2-gram) occurs in English words much more frequently than “Qo.” To benefit from an n-gram dictionary, the confidence assigned to an n-gram is some combination of (1) the aggregate character confidences and (2) the n-gram frequency provided by the n-gram dictionary. Thus, recognition may have produced Oueen and Queen where the first character has the possibility set: poss=O, conf=200; poss=Q, conf=100, but in the English language “Qu” happens much more often than “Ou”, so the 2-gram dictionary would help determine that Queen is the more likely parent string.
A need exists for a method of generating candidate strings in ranked order on an as-needed basis and, more generally, for a method of generating ranked parent candidates on an on-demand basis from a series of sibling possibilities. A need also exists for such a method that can be used with data at different logical levels in a logical data hierarchy, such as n-grams, words, and phrases.