Test data can be of great value for testing data processing systems. In comparison to actual data drawn from specific domains, the truth value of the test data is known so that any errors in the processing of the data can be distinguished from errors in the data itself. As the sophistication of the data processing programs increases, the test data must also increase in sophistication to maintain realism and support the evaluation of complex processing procedures and algorithms that exploit contextual relationships and other expectations about the actual data.
For example, data capturing systems now use contextual data to improve the speed and accuracy with which information is acquired. Typically, data is acquired from hand-printed forms using optical character recognition (OCR) systems supplemented by human key entry systems. The OCR system begins either by trying to read an entire form field at once and comparing a provisional field answer to large dictionaries of possible outcomes or by segmenting the form field into separate characters and reassembling the characters into a provisional field answer. A preliminary confidence value is calculated that reflects the OCR system's assessment that it has the correct answer, e.g., the degree to which the hand-printed data matches recognized character or word forms. More sophisticated recognition systems use context-related information to make adjustments to this confidence value. For example, if a last name is read as “JOHNSON” on a form from a given household, and there are several other people in the same household whose names are read as “JOHNSSON”, then the recognition system may reduce the confidence value for the “JOHNSON” answer. As another example, if a person's first name is read as “Clara” and if a corresponding check-box question for the person's sex is read as “Male” instead of “Female”, then the confidence in the “Male” answer may be lowered. When all the pertinent context information has been utilized, the final confidence value is compared to a previously established “confidence threshold” to decide if the provisional answer in question will be “accepted” or “rejected”. If accepted, the field answer can be placed into the database without being seen by a human, but if rejected, field image information is shown to a human to key the correct answer from the image. The ability of the data capture systems to assign proper confidence values to field data being recognized is one of the keys to high quality data capture system performance.
Realistic test data for evaluating data capturing systems should not only be context-related within individual records (e.g., individual forms) but should also include controllable distributions of data among the records including modeled errors for evaluating the performance of data capturing systems. Such data allows the validity of assumptions to be assessed, criteria tuned, and logic and other rule forms tested for efficacy or functioning as intended.
Similarly, test data of increasing sophistication is necessary for more fully evaluating data processing systems for processing domain-specific data, such as Census data, Internal Revenue Service data, financial transactions, and medical records. Such test data should not only model real-world data but should also be controllable in terms or real-world variables for (a) posing questions and monitoring the responsiveness of the processing systems to changing conditions or assumptions or (b) evaluating the fidelity of processing programs for carrying out complex rules or the efficacy of the rules themselves for achieving desired outcomes.