Organizations that rely on large amounts of data have a need for that data to have a high level of quality. ‘Quality’ as it pertains to data refers to the extent to which data values exhibit characteristics such as accuracy, precision, completeness, integrity, consistency, etc. In some cases, low data quality can lead to negative practical effects on the organization, such as records being handled incorrectly, inaccurate data being provided to members of the organization, inefficient system operation, system failures, etc. For a business organization, such effects can quickly lead to customer dissatisfaction.
For very large datasets, automated systems have been developed to evaluate data quality for those datasets and to identify and report on incidences of low data quality. Corrective measures may then be taken to improve the data quality of a dataset so identified, such as by reprogramming the system that produces the dataset to favorably adjust the content of the dataset. Typically, organizations such as businesses have such a large volume of data that it is not practical for human operators to evaluate data quality of the data, and consequently a data quality engine may be developed that can automatically measure data quality and ensure the data is meeting the needs of the organization.
A data quality engine may measure data quality for a dataset by examining values of data fields (also referred to simply as “fields”) of the dataset using predefined data quality rules. The data quality rules may define criteria for evaluating values of fields, such as by identifying characteristics (e.g., accuracy, precision, etc.) of the values according to the criteria. The extent to which the values exhibit these characteristics may thereby produce a measure of data quality for the fields. By evaluating the data quality rules for data fields, therefore, a data quality engine may automatically produce a measure of data quality. In some cases, the data quality engine may evaluate the data quality of a single record that comprises values with multiple data fields by evaluating data quality rules for one or more of the data field values in the record. In some cases, the data quality engine may evaluate the data quality of a dataset as a whole by combining data quality measures produced by evaluating data quality rules for each of the fields of the dataset.