1. Field of the Invention
The present invention relates to systems for processing data. More specifically, the present invention concerns systems for identifying duplicate data records and may be used to store such duplicate records in a data warehouse.
2. Discussion of the Prior Art
Modern businesses rely on computer systems to execute their various business processes. A typical business may operate several different computer systems, with each system addressing a particular need. Each of these systems generates disparate data that is valuable to the business. A data warehouse is commonly used to store, organize, manipulate and retrieve this data.
Generally, a data warehouse is a repository of current and historical data pertaining to any subject, entity, or other focus. Advantageously, a data warehouse may store different types of data from different types of systems in a manner that can be efficiently searched, retrieved and analyzed. A data warehouse operated by a business may, in this regard, receive and store data from several different legacy systems operated by outside vendors, from an external server of an operational system operated by the business, and from an internal server operated by the business. A data warehouse may be used in many different environments, with or without conventional elements such as data marts, archives, and other data warehouses.
Because data warehouses receive large amounts of data from different sources, the received data is often duplicative of other stored or received data. Problems may occur if the duplicative data is not recognized as such. For example, contact information may be stored in a data warehouse in association with a customer ID representing a particular customer. Next, data may be received that includes contact information of the particular customer. The received data may differ from the stored data due to entry errors, changes in contact information (e.g., an address), or the like. However, for the purposes of the present description, the stored data and the received data are consider duplicative because they both represent contact information of the particular customer. Unless the received data is recognized as duplicative, the received data may be associated with a new customer ID and stored in the data warehouse in association therewith. Therefore, in a case that the stored data is used to generate advertising mailings, two sets of mailings and other communications would be sent to the particular customer. Moreover, customer behavior cannot be properly analyzed if actions of one customer are attributed to two or more customers represented by customer IDs maintained by the data warehouse. Of course, problems caused by duplicative data are not reserved to contact information.
Proportions of stored duplicative data, as well resulting problems, increase along with the volume of data stored in a data warehouse. In fact, currently-operating data warehouses include up to 12% duplicate records. Of course, an average percentage of duplicate records may vary across businesses. Systems have therefore been developed that attempt to address the problem of duplicate records. These systems, such as Match IT™, DeDupe™ and TrueMatch™, purport to identify duplicate records, however the present inventor has not found these systems to be satisfactorily efficient, effective, or compatible with existing data warehousing systems.