Large surveys, such as census population surveys, often utilize forms on which respondents enter information in response to various survey questions by hand. The survey forms are then electronically scanned and optical character recognition (OCR) software is utilized to transform the handwritten text responses into electronic data (referred to herein as OCR text strings). The OCR text strings may, in general, include any combination of one or more characters or symbols, and the term character is used herein to refer to both characters and symbols output by OCR software.
In order to facilitate compilation and analysis of the data, a coding process may be undertaken wherein the OCR text strings are mapped to various categories and assigned codes (e.g., numeric, alpha, or alpha-numeric codes) associated with the categories. The categories may be represented by text strings included in a coding dictionary. The text strings may, in general, include any combination of one or more characters and symbols. In order to map an OCR text string to a category, the OCR text string is compared with text strings in the coding dictionary in order to identify which text string in the coding dictionary the OCR text string most closely resembles, if any. Such comparison may be based on a technique known as the Levenshtein Distance Algorithm (LDA). The LDA involves the computation of a numeric value referred to as the Levenshtein distance (also sometimes referred to as the edit distance) representing how many changes must be made to a particular text string in order to make it identical to another text string with which it is compared. For example, if the OCR software interprets an entry on a census survey form as “GLEEN!WORD V.LLAG” and such OCR text string is compared with the text string from the coding dictionary “GREENWOOD VILLAGE”, three character substitutions (a “L” for the “R” and a “R” for an “O” in the first word and a “.” for the “I” in the second word), one character deletion (an “E” missing at the end of the second word), and one character insertion (the “!” between the “N” and the “W” in the first word) are present, and hence such comparison is assigned a Levenshtein distance of five.
Unfortunately, accurately mapping the OCR text strings to the appropriate text strings included in the coding dictionary using the LDA is complicated by the fact that the OCR text strings often include errors due to the inherent difficulties present in performing optical character recognition of handwritten text. Such errors (e.g., incorrectly recognized characters, inserted or noise characters, and/or deleted characters) can result in inaccurate Levenshtein distances when comparing the OCR text strings with text strings in the coding dictionary using the LDA. Inaccuracies in the Levenshtein distances reduce confidence that the OCR text strings have been accurately mapped to the proper text strings in the coding dictionary and can lead to assignment of improper codes thus reducing the usefulness of the collected survey data.