Distributed databases are a significant feature of many modern network applications, including DNS resolution, user validation, financial support, and myriad business activities. Details of exemplary database structures, including update operations, that may be suited for validation according to the subject matter discussed herein may be found, for example, in U.S. Pat. No. 6,681,228, Jan. 20, 2004, U.S. Pat. No. 7,047,258, May 16, 2006, U.S. Pat. No. 7,167,877, Jan. 23, 2007, and U.S. Pat. No. 7,203,682, Apr. 10, 2007, the contents of which are incorporated herein by reference in their entireties.
One issue common to distributed database applications is the need for data integrity among the various copies of the database, particularly in light of updates, and the like, that cause changes to the data. This issue often involves a balance of many factors, including the nature and volume of the data, the number and function of copies of the database, and rates of change and access to the data.
Data scrubbing, also known as data cleansing, is the process of identifying, amending and/or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated. Scrubbing can also be used proactively to help identify when software or other problems may be contributing to errors in the database, such as a distributed database with replication problems, etc. Organizations in data-intensive fields, such as online services, banking, insurance, retailing, telecommunications, etc., may use data scrubbing tools to examine data in one or more databases for flaws by using rules, algorithms, and look-up tables. Using automated data scrubbing tools has become increasingly necessary as the size and complexity of databases has expanded beyond the realm where manual analysis, review and correction by database administrators is impracticable.
Without timely and accurate data cleansing, various problems can ensue including, for example, merging corrupt or incomplete data from multiple databases in distributed database structures, providing inaccurate data to requesting users, etc. This is particularly acute in systems where improperly functioning or corrupted software, or other operational factors, can multiply an error over thousands or millions of pieces of erroneous, duplicated or inconsistent data. Additionally, many applications have particular concerns with respect to integrity, performance, and recovery of data in the database. For example, in systems that are relied on to provide critical support services for the Internet, concerns such as the reliable availability of the service, the ability to improve access through geographic distribution of operational components, and data backup and recovery are all critical. As for throughput and availability, in the context of DNS resolution, sites for one registry may answer over thirty four billion queries a day. Therefore, it is important to maintain a thorough and timely assessment of the data in a database, particularly when large-scale databases including distributed replication, or data warehouses that merge information from many different sources, are being used in high-accessibility, high-volume applications such as DNS resolution and the like. However, based on parameters such as the size of the database, the required accessibility to the database, and the like, it can be more difficult to maintain a thorough and up to date assessment on the data in a database without, for example, taking certain copies of the database offline, applying read/write locks, etc. and other measures that detract from the overall efficiency of the database.