A regular expression is sequence of pattern-defining characters that has long been employed for identifying strings of text. For example, the regular expression, [0-9] [0-9] [0-9]-[0-9] [0-9]-[0-9] [0-9] [0-9] [0-9] may be employed to identify social security numbers in the standard format (such as 333-22-4444). In order to determine if a given regular expression matches a string of text, string searching algorithms may be implemented.
One string searching algorithm commonly implemented is the Aho-Corasick algorithm. The Aho-Corasick algorithm, which is typically employed to determine whether the string contains a match to the regular expression, generally is a two step process. The first step in the process converts the regular expression into a Deterministic Finite State Automaton (DFSA). The second step in the process utilizes the DFSA constructed in the first step to linearly scan the string in order to identify an exact match to the regular expression. The Aho-Corasick algorithm is well-known and is widely available for review online and will not be repeated here.
Generally speaking, it is well known, that it is possible to employ a string searching algorithms such as the Aho-Corasick algorithm to identify all occurrences of a regular expression when matching Latin-script, typewritten text. However, it is significantly more difficult to identify strings of text which may appear in an image form, e.g., in an image. An example of an image that includes a string of text in image form may be an employee's W-4 form that has been filled out by hand-printing and then scanned in for document retention.
One approach implemented for identifying strings of text in an image that contains an image version of that string of text involves first performing Optical Character Recognition (OCR) on the image and then performing a string searching algorithm to match the regular expression(s) in the OCRed text. However, this approach is generally not deemed to be sufficiently accurate, efficient or workable for text other than neat type-N % Titen text because the OCR approach tends to generate many recognition errors per page. The recognition errors degrade the performance of any subsequent matching process.
Consider the situation wherein, for example, an OCR engine is applied to an image of the hand-printed string 012-34-5678. In this example, performing OCR may result in the recognized string “OI2-34-S678” where the digit “0” is recognized as the letter “O”, the digit “1” as the uppercase “I”, and the digit “5” as the letter “S”. Once the OCR has been performed, a string searching algorithm may then be employed on the OCR result. Because the string searching algorithm performs a linear scan on the OCR result, an erroneously recognized string of text will result in a match failure or an erroneous match.
In view of the foregoing, improved techniques for recognizing textual strings that appear in image forms are desired.