1. Field of the Invention
This invention relates to computer software and database record matching, and more specifically, to systems for using multiple reference files in cleansing, linking, and appending data elements of input records from a database to improve deliverability of the database, matching of data elements, and appending new data elements to the database.
2. Related Art
There are numerous businesses and governmental agencies that require current, up-to-date information on persons located throughout the country, and perhaps even the world. Most notably, mail carriers such as the United States Postal Service (USPS), as well as, Federal Express and United Parcel Service, require current name and address information on persons in order to efficiently deliver mail. In addition, publishers and other media service providers, such as television and satellite broadcasters, internet service providers, and the like, require the same up-to-date contact information for their customers. Due to the vast number of entries needed to make such a database useful, it is common to expect such a database to contain millions of data records, and as many as hundreds of millions or even a billion data records.
Also, due to the competitive nature of these industries, many of the service providers require additional information contained in one or more reference files to be appended to each person's data record in the main, input database. For example, additional information may include, but not be limited to, demographic information, purchasing history, name change information, and the like. Thus, there is a need for a computer software system that efficiently manages a large input database and is capable of quickly and accurately updating data records in the input database, including the cleansing of existing data in the data records of the database and appending new data to the data records of the large database according to matching data records in one or more reference files. Thus, based on the large size needed for such an input database, a system is needed that can process (either by cleansing or appending new data) about one million to about ten million data records per hour in order to make the input database and the accompanying software system usable to a service provider. In addition, a system is needed for efficiently processing individual data records of a database wherein the time for processing a single data record is important.
Today, conventional software systems directed to handling large input databases of this type perform only one operation at a time. For example, if an input database containing contact information for different persons requires each data record to be postal coded, processed with any USPS National Change of Address request, and appended with the corresponding telephone number, conventional software systems perform one operation at a time. This results in each data record being read from the input file and written to an output file three different times—one for each required action or process. Considering the fact that the input and output files are typically stored on remote storage devices, due to the vast size of such files, the extra reads and writes are exceedingly costly in both time and computer processor resources.
Therefore, there is a need for a system and method for recycling a data record of an input file back through processes that the data record has already been through when a later process changes a value of a data element in the data record. This concept of recycling each data record back through all of the prior processes before moving to the next step has not generally occurred in any conventional systems. Conventional systems that do attempt to perform similar recycling functions therefore require significantly increased set up time, machine resources, and elapsed time, if or when requested by the client.
For example, in the case of a Change of Address Process (COA), the COA reference file may not chain moves that have been made in one case as an individual and in another case as a family. This means if a data record in an input file was modified by a matching record in a COA reference file with a new address, that same data record may still have a more recent address change in a second COA reference file. It has been the general practice in the industry to separate the records related to a new address by a COA reference file, then process these records back through all of the matching records in the COA reference file, and then merge the data records back into the original input file. This procedure is time consuming, error prone and costly. This same problem is encountered with any single process or combination of processes when reprocessing of changed records is desired.
The industry typically uses some form of a weights and penalty matcher, sometimes with some limited additional logic, such as requiring an exact match on house number or unit number. The result is a number of matching errors of either under-match or overmatch. A large part of the cause of this error problem is that this conventional approach reduces the match decision to a greater than or less than comparison with a scalar value. If every parsed element of the name and address and other identifying information to the extent available, such as parts or all of SSN, Phone Number, DOB, etc., were graded as being a match or not a match, then the decision becomes whether the candidate match is on one side or the other of a decision surface in an N+1 dimensional space, with N−1 more degrees of freedom in the decision making process.
U.S. Pat. No. 6,658,430 to Harman discloses a method and system for reformatting a text file such that a resulting output file can be easily manipulated, enhanced and postal coded. Although the Harman system provides a new method for reformatting an ASCII or similar text file, it is readily apparent the Harman system does not address the process of cleansing the data records of an input file and improving the efficiency of such cleansing by reducing the number of reads and write to remote storage or recycling a record that has changed information.
In addition to the above-discussed systems, there is no prior art software system currently available that processes a data record of an input file in one pass of each client record through multiple processes in any of the referenced patents, articles, sales materials, or world wide web pages. The industry norm today is to process the data records of an input file, such as a client file, against multiple reference files by processing all of the data records of the input file against one reference file at a time, resulting in two accesses of remote storage for each data record of the input file for each reference file. That is, each data record of the input file is read from remote storage, e.g., disk, processed against one reference file only, and then written back out to remote storage. This procedure is then repeated for the next reference file. As a result, such prior art systems are extremely set-up labor intensive and take an exorbitant amount of wall clock time. Although several companies have developed and implemented graphical front ends for these prior art systems to facilitate the job setup of such multifunction jobs, such graphical front ends do not address the multiple reads and writes to remote storage required for each data record of an input file for the number of reference files, nor do such prior art systems provide a means for recycling data records.