1. Technical Field
The disclosure relates generally to data processing, data mining, and knowledge discovery.
2. Description of Related Art
Along with the revolutionary advancements in commercial and private enterprises brought about by the introduction of the personal computer have come new problems. Particularly, with respect to the Internet, both electronic commercial exchanges, also now known as “E-commerce,” and direct business-to-business electronic data processing, have led to decreasing quality control with respect to data records received from other parties. In other words, in traditional systems, only a company's select few employees had authority to enter data directly into an established database in accordance with rules generally designed to optimize data integrity. Now, in order to speed processes, remote access to a database may be granted to a plurality of persons or entities, e.g., clients, customers, vendors, and the like, who may be using a plurality of different software programs or simply may ignore the requirements intended by the associated enterprise receiving data and maintaining the database . As a result, the database may contain duplicative and erroneous data which must be “cleaned.” “Data cleaning,” or “data clean-up,” are the terms of art generally used to refer to the handling of missing data or identifying data integrity violations, where “dirty data” is a term generally applied to input data records, or to particular data fields in the string of data comprising a full data record, which may have anomalies, in that they may not conform to an expected format or standard or content for the established database.
Many companies need to analyze their business transaction records or activity records to either create a database or to match each against an existing database of their customers, clients, employees, or the like. For example, consider a data intensive commercial enterprise such as processing credit card transactions. Each transaction may comprise an electronic digital data packet in which a data string is broken into predetermined fields wherein each field may contain specific information; e.g. each packet might contain: <name, telephone number, postal code, credit card number, transaction amount>. On a worldwide basis, millions of transactions can be logged in a single twenty-four hour period for the card processor to receive, store, and process. Many different types of data errors may be introduced in each transaction. For example, one regular complication arises where the merchant-identifying data field for the transaction record is polluted with information specific to the individual transaction. As examples, consider a data set of transactions where an intended “authorized merchant name” field indicates not only the name, but also additional, variable information added by the merchants:                EBAY #234983498, EBAY #392385753, EBAY # . . . where the Internet auction web site commonly referred to as “EBAY” has entered both its name and a specific on-line auction item identification number;        UNITED AIRLINES #387394578, UNITED AIRLINES #948693842, UNITED AIRLINES # . . . , where UNITED has entered both its name and a specific passenger ticket number; and        MACY'S WOMAN'S CLOTHING, MACY'S TOYS, MACY'S . . . , where one or more stores known as MACY'S have entered both its name and a specific sales department of the store and where such departments may vary from store-to-store.The credit card processor is looking for a distinct “name” and while each “name” field is distinct, there may be three or more authorized merchants, EBAY, UNITED AIRLINES, MACY'S, for the processor to sort out. Consider further the example of the chain stores “WALMART,” and for which may appear in the daily log of transactions references such as: WALMART #239823, WALMART #234894, WALMART #459843, and WALMART #958384, where each WALMART enters both its name and a specific store number, e.g., #239823 being in Palo Alto, Calif. , #234894 being in Mt. View, Calif., and both #459843 and #958384 being in Cupertino, adding a potential complication wherein two different store locations may reside in the same city under a same U.S. zip code.        
From this example of a credit card processor, it can be recognized that storing each individual activity for enterprises which have a broad installed base from which extensive input data is regularly received without cleaning dirty data and eliminating unnecessary duplication of information may lead to extensive and generally expensive hardware requirements in terms of data storage and data processing resources. Perhaps more importantly, dirty data degrades the quality of data analyses processes. Moreover, while it can be determined that certain dirty data may allow a many-to-one mapping intuitively, it is a slow, manual labor task—e.g., one can study a log of transactions and come to realize that in every transaction for “EBAY . . . ,” it is always related to data representative of the city “Palo Alto” and the state “CA” and therefore all transaction records can be assigned to a single file of the database for that store, likely a transaction number given out by the EBAY corporation.
It would be advantageous to build and maintain databases which cleans data and consolidates duplicative data automatically in addition to other advantages.