Field of the Disclosure
The disclosure herein relates to systems, methods, and computer program products for de-duplication of data. Supervised machine learning methods, systems, and computer program products of data de-duplication are described.
Description of the Related Art
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventor, to the extent it is described in the background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art with respect to the present invention.
De-duplication of data, i.e. elimination of duplicate data has become increasingly important with the exponential growth of available data. Data can be valuable when it is organized and streamlined. In contrast, data dumping without organization results in wasted resources looking for the useful data in an electronic warehouse of non-useful and/or duplicate data. This can result in multiple versions of the same data being retained or other data being completely dropped amongst the volume of data.
Attempts have been made in industry to merge data, wherein the data is gradually blended in stages to blur a distinction between the different groups of data. Methods and systems currently used include “blend,” “pick and choose,” and “list & stack.” Blend data is blended from a group of duplicates to form a single record. Pick and Choose data is a record picked from a list of duplicates. List and Stack data is a blended record, along with the individual duplicate records from which the blended record was built. Pick and Choose methodology compares the constituent data elements.
When there's a discrepancy, the Pick and Choose methodology defaults to one credit element, usually the most derogatory. Since the data selected may not be the most current or provide a fair representation of the consumer's credit history, this method slows or stops the credit evaluation process while the data is reviewed by lending and financial specialists. In contrast, List and Stack methodology lists all of the data received from the credit bureaus, leaving it to lending and financial specialists to manually sort and determine duplications and discrepancies. Since the List and Stack method relies on human interpretation, it can slow the evaluation process, produce a skewed result, and/or limit consistency in decision making. A blend of data on the other hand provides the most accurate data for each element, providing businesses with a more balanced and fair credit picture of the consumer. However, there are many different implementations of these, as well as other methods and systems of de-duplication, wherein each implementation produces different results.
Multiple credit bureaus are an example in which data from each bureau is merged to form a single cohesive report. The first third-party credit reporting agencies were established in the early 1830s. These agencies eventually came to function much like modern-day franchises, with a national scope. These agencies are presently recognized as being dominated by three major credit bureaus: Equifax, Experian (formerly TRW), and TransUnion. These credit bureaus provide a credit report which is a record of credit activities for individuals. The credit reports list any credit-card accounts or loans that an individual may have, as well as balances and how regularly the individual makes payments. In addition, the credit reports indicate whether any action has been taken against the individual because of unpaid bills.
There are five major components to a conventional credit report. A first component includes personal identifying information, such as name, address (current and previous), social security number, and optionally telephone number, birth date, and current and previous employers.
A second component of the credit report includes credit history, which includes a section on bill paying history with banks, retail stores, finance companies, mortgage companies, and other entities that have granted credit to the individual. This credit history includes information about each account the individual has, such as when it was opened, what type of account it is, how much credit is granted on the account, and what is the monthly payment that is due. If the account is closed or the loan has been paid off, that information will be included in the report as well.
A third component of the credit report may include credit worthiness with regard to tax liens, court judgments, and bankruptcies. Other public records related to credit worthiness can be included.
A fourth component of the credit report may include report inquiries which list all credit grantors who have received a copy of the credit report within a specified timeframe, and other authorized entities that have viewed the credit report. In addition, the credit reporting system tracks the companies that have received the name and address in order to offer the individual a firm offer of credit.
A fifth component of the credit report may include consumer statements that, among other things, include disputes that the individual may have made regarding the report following reinvestigation. Both the consumer and the creditor may make statements on the report. Each of the three major credit bureaus gets its information because they serve as clearinghouses for credit information about customers.
In addition to the three national credit bureaus, there are several hundred or thousands of local and regional credit bureaus around the country that obtain information directly from the individual's creditors. These smaller local and regional bureaus are typically affiliated with one of the three national credit bureaus.
The activities of all credit bureaus are governed by the Fair Credit Reporting Act, which is a U.S. federal law (35 U.S.C. §1681) that regulates the collection, dissemination, and use of consumer credit information. The Fair Credit Reporting Act places strict requirements on the consumer credit reporting agencies. For example, the nationwide consumer credit reporting agencies must make available to consumers information about them contained in the agency's files, upon request and at no charge at least once per year. Also, if negative information is removed as a result of a consumer's dispute, it may not be reinserted (after verification) without notifying the consumer within five days in writing. Also, the consumer reporting agencies may not retain negative information for an excessive period. For example, the Fair Credit Reporting Act places restrictions on how long negative information, such as late payments, bankruptcies, liens, or judgments may stay on a consumer's credit report, which is typically seven years from the date of delinquency. One exception is bankruptcies, which may stay on the record for ten years, and tax liens which may stay on the record seven years from the time they are paid.
Many products and services are provided on the basis of an agreement or expectation that the consumer will make one or more future payments. For example, in the case of a mortgage, auto loan, credit card, installment payment plan, or medical procedure, the consumer receives a product or service with the expectation that they will pay for it (possibly with interest) in the future. In the case of an auto lease or apartment rental, the consumer receives the use of a product, and the arrangement will only be beneficial to the provider of the product if the consumer pays for the use of the product for a minimum period of time. In such cases, the provider (“lender”) of the product or service would like to be able to ascertain the consumer's ability and likelihood to pay in the future. This need has led to the establishment of the above-described credit bureaus that collect data about consumers and furnish credit reports, credit attributes, and credit scores that lenders can use to decide whether to provide the product or service and terms of providing the product or service (risk-based pricing).
Service providers can also use credit information for account management, for example in deciding whether to increase the amount of a line of credit, to modify loan parameters, or to provide promotion/retention offers. Also, credit information is sometimes used in other areas such as employment and security clearance decisions, where credit risk is thought to be correlated with other behavioral risk.
FIG. 1 shows a schematic view of a conventional credit report 10. The credit report 10 among other things includes personal identifying information, such as name, address, social security number, telephone number, birth date, and present and previous employers. This information is presented as personal identifying information 11. The credit history 12 is also reported and includes a section regarding bill-paying history with banks, retail stores, etc. The credit history also indicates what entities have granted credit to the individual, and includes information about each account presently opened and formerly opened by the individual, and paid obligations for a specified timeframe. Public records 13 include information regarding credit worthiness, such as tax liens, court judgments, and bankruptcies. Report inquiries 14 include all credit grantors who have received a copy of the credit report 10 within a specified timeframe. It also includes any others who are authorized to view the credit report 10. Dispute statements 15 include a summary of statements made, such as disputing information on the credit report 10.
Credit attributes within a credit report from a credit bureau can be compared to credit attributes within a credit report of another credit bureau to create a blended credit report by de-duplicating data and performing a blend of credit data. Credit attributes include, but are not limited to trade lines, public records, inquiries, addresses, fraud reports, and directory items. One approach of de-duplicating a trade line includes evaluating the similarity based on matching a number of additional attributes, including an account number (AN1, AN2, AN3 . . . ), an account type (AT), the date the account opened (DO), data repository sources (DR, SR), high credit (HC), a lost or stolen indicator (LS), the bureau identity (ID), subscriber information (CN), the balance (BL), the payment amount (MP), and the credit limit (CL). Each individual attribute value is compared across two trade lines to arrive at a binary decision of similarity or dissimilarity. An index is assigned to each individual feature to arrive at a total index. A value of 0 is added to the total index when the individual feature does not match and a value of 1 is added to the total index when the individual feature matches. The table below illustrates individual indices and a total index for the attributes listed above.
Index TableAN1AN2AN3ATDOSRHCDRLSIDSSCNBLMPCL2{circumflex over ( )}142{circumflex over ( )}132{circumflex over ( )}122{circumflex over ( )}112{circumflex over ( )}102{circumflex over ( )}92{circumflex over ( )}82{circumflex over ( )}72{circumflex over ( )}62{circumflex over ( )}52{circumflex over ( )}42{circumflex over ( )}32{circumflex over ( )}22{circumflex over ( )}12{circumflex over ( )}01638481924096204810245122561286432168421
The total index is evaluated as the sum of the indices for each of the individual attributes. For the table illustrated above, the maximum total index=1+2+4+8+16+32+64+128+256+512+1024+2048+4096+8192+16384=32,767. A lookup is then performed in a lookup table to obtain a resultant indicator of a match or non-match based on the total index used as a displacement within the decision rule table.
In addition to the trade line comparisons and the index computations described above, there are multiple rule files to consider for collections, non-collections, and conditional rules, wherein a conditional rule can confirm or negate a previous decision. Newer features can also be implemented in code rather than a decision matrix, which results in a modified merge index decision over ridden with exception files. Also, changes can be made to the merge index decision based on reactionary customer complaints, which require a full release or regression.
As illustrated in the index table above and as per the contents of the decision matrix, there is a heavy reliance on an account number match. As an example, a collection trade line can require at least a 90% match of the account number. As a result, there may only be a 5% positive indication. In a non-collection trade line, a 30% match of the account number may be required, which can result in a 25% positive indication. However, a credit bureau mandate requires an account number suppression, which greatly affects a collection trade line by placing more emphasis on less reliable trade line comparisons.
Current approaches based on a manual decision for a particular index within a decision table results in only the most likely locations being updated with a positive match for the decision value. This arises because of the limitation of a human to comprehend the entire address space of 15 feature vector. The decision table is thus sparsely populated with positive match values.