1. Field of the Invention
The present invention relates to the field of data management, and, more specifically, to selecting data to be corrected.
2. Description of the Prior Art
Incorrect data is a frequent and costly problem within modern data architecture. As the quantity of data that is processed and stored increases at an exponential rate, incorrect data quality will to continue to become increasingly problematic. Incorrect data may result in tremendous profit losses for providers of goods and services. For example, if a mail order catalog provider has incorrect records of consumer addresses, the catalog provider will be unable to reach its target consumers. Incorrect data may be caused by such factors as duplicate data, non-standardized data, and data decay. For example, data corresponding to consumer addresses may be incorrect due to an error in transcribing the data. Furthermore, even if the data was transcribed correctly, because many consumers change address without notifying goods and services providers, such address data is often outdated.
Incorrect data often requires a great deal of time and expense to correct. One reason for this high cost is that data is rarely used by those who are able to both recognize incorrect data and provide necessary feedback to correct the data. For example, address data may be used by a marketing department to create a demographic profile of consumers. However, it is consumers, rather than the marketing department, that are most able to recognize incorrect address data and provide the necessary feedback to correct the data. Unfortunately, such consumers are only likely to view their address data if such data is correct and correspondence is sent to a correct address.
Another reason for the high cost of data correction is that data correction often requires multiple steps to identify incorrect data and to obtain feedback to correct such data. For example, to correct address data, an initial contact must be attempted to determine if the address is correct. This initial contact may often involve several attempts, as many consumers may not readily respond to correspondence sent to a correct address. Furthermore, a secondary contact is often required to obtain feedback to correct the data.
Due to the high cost of data correction, it is generally not feasible to correct a large volume of data in its entirety. Rather, when dealing with a large volume of data, it is generally cost effective to select a portion of the data to correct. Typically, it is most cost effective to select data for correction that is low quality and has a low correction cost. However, a determination of data quality and correction costs may often be inaccurate because it is based on anecdotal evidence rather than on actual error statistics. Specifically, a data steward or manager may determine that a portion of data is particularly low in quality if he or she has personally experienced or been notified of a large number of specific errors within the portion. Such errors may not be representative of the overall quality of the data. Furthermore, even if determinations of data quality and correction costs are accurate, tremendous costs may already be incurred before a portion of data is determined to be particularly problematic.
Such error costs may be eliminated by pre-determining which portions of data are likely to cause errors before such errors actually occur. Such a pre-determination may be made based on data attributes such as, for example, the type, domain, structure, size and volume of the data. Thus, there is a need in the art for systems and methods for selecting data to be corrected.