The invention is directed to a system and method for linking data that pertains to like entities. In particular, the invention is directed to a system and method for linking data pertaining to consumers, businesses, addresses, occupancies, and households using permanent, universally unique tokens.
Virtually all businesses today find it necessary to keep computerized databases containing information about their customers. Such information can be used in a variety of ways, such as for billing, and for keeping consumers informed as to sales and new products. This information is typically stored electronically as a series of records in a computer database, each record pertaining to a particular customer. Records are logical constructs that may be implemented in a computer database in any number of ways well known in the art. The database used may be flat, relational, or may take any one of several other known forms. Each record in the database may contain various fields, such as the customer""s first name, last name, street address, city, state, and zip code. The records may also include more complex demographic data, such as the customer""s marital status, estimated income, hobbies, or purchasing history.
Businesses generally gather customer data from a multitude of sources. These sources may be internal, such as customer purchases, or external, such as data provided by information service providers. A number of information service providers maintain large databases with broad-based consumer information that can be sold or leased to businesses; for example, a catalog-based retail business may purchase a list of potential customers in a specific geographic area.
Because businesses use varying methods to collect customer data, they often find themselves with several large but entirely independent databases that contain redundant information about their customers. These businesses have no means by which to accurately link all of the information concerning a particular customer. One common example of this problem is a bank that maintains a database for checking and savings account holders, a separate database for credit card holders, and a separate database for investment clients. Another common example is a large retailer that has separate databases supporting each of its divisions or business lines, which may include, for example, automotive repair, home improvement, traditional retail sales, e-commerce, and optometry services.
Businesses with multiple, independent databases may find it particularly valuable to know who among their customers come to them for multiple services. For example, a bank may wish to offer an enhanced suite of banking services to a customer that maintains only $100 in his or her savings account, if the bank could also determine that this same individual maintains a $100,000 brokerage account. This information could also be valuable, for example, to take advantage of cross-selling opportunities and to assist the business in optimizing the mix of services to best serve its existing customer base.
Linking all available data concerning each customer would also allow each of the business""s divisions to have access to the most up-to-date information concerning each customer. For example, a customer may get married and relocate, then notify only one of the business""s divisions concerning the change. Suppose that Sue Smith, residing in Memphis, becomes Sue Thompson, residing in Minneapolis. If only one of the business""s data processing systems xe2x80x9cknowsxe2x80x9d about the change, the other systems would be unable to determine that xe2x80x9cnewxe2x80x9d customer Sue Thompson in Minneapolis is the same person as existing customer Sue Smith in Memphis.
One of the oldest methods used to combat this problem is simply to assign a number to every customer, and then perform matching, searching, and data manipulation operations using that number. Many companies that maintain large, internal customer databases have implemented this type of system. In theory, each customer number always stays the same for each customer, even when that customer changes his or her name or address. These numbers may be used internally, for example, for billing and for tracking packages shipped to that customer. The use of a customer identification number eliminates the potential ambiguities if, for example, the customer""s name and address were instead used as identifiers. Financial institutions in particular have used personal identification numbers (PINs) to unambiguously identify the proper customer to which each transaction pertains.
Customer number systems are inherently limited to certain applications. Customer identification number are not intended to manage a constantly changing, nationwide, comprehensive list of names and addresses. Companies maintaining these numbers are generally only interested in keeping up with their own customers. Thus the assignment process for such numbers is quite simplexe2x80x94when a customer approaches the company seeking to do business, a new number is assigned to that customer. The customer numbers are not the result of a broad-based process capable of managing the address and name history for a given customer. Also, the customer numbers are assigned based only on information presented to the business creating the numbers. The numbers are not assigned from a multi-sourced data repository that functions independently of the company""s day-to-day transactions. In short, the purpose of such numbers is simply transaction management, not universal data linkage. Such numbers are also not truly permanent, since they are typically retired by the company after a period of inactivity. Again, since the focus of the customer number assignment scheme is merely internal business transactions, there is no reason to permanently maintain a number for which no transactions are ongoing. These numbers cannot be used externally to link data because every company maintains a different set of customer numbers.
Although externally applied, universal numbering systems have not been used for consumers, they have been made publicly available for use with retail products. The universal product code (UPC) system, popularly known as xe2x80x9cbar codes,xe2x80x9d began in the early 1970""s when a need was seen in the grocery industry for a coding system that was common to all manufacturers. Today, the Uniform Code Council, Inc. (UCC) is responsible for assigning all bar codes for use with retail products, thereby maintaining a unique UPC number for every product regardless of the manufacturer. A database of these codes is made publicly available so that the codes can be used by everyone. Using this database, every retailer can track price and other information about each product sitting on its shelves. Today""s product distribution chains also rely heavily on the UPC system to track products and make determinations concerning logistics and distribution channels.
While the UPC system has been enormously successful, the system""s usefulness is limited. To obtain a UPC number for a new product, a manufacturer first applies for a UPC number, the product and number are added to the UCC database, and then the manufacturer applies the proper bar coding to its products before they are distributed. There is no scheme for assigning UPC numbers to pre-existing products, and no scheme for matching UPC numbers to the products they represent. Also, since each UPC number represents a single, distinct item packaged for retail sale, there is no scheme for identifying the various elements of a particular product to which a single UPC number is assigned. The UPC system thus could not be used to link various pre-existing data pertaining to consumers and addresses.
A final but vitally important issue raised by the use of any identification number system with respect to individuals is privacy. A company""s internal-only use of a customer identification number raises few privacy concerns. But the external use of a customer number or PIN with respect to an individual increases the risk that the individual""s private data may be easily shared in an unauthorized or illegal manner. The potential for misuse thus makes customer number systems unacceptable solutions for an information service provider seeking to develop an externally-distributed linking system for data pertaining to the entire United States consumer population.
Given the limitations of identification number systems, the only comprehensive method to eliminate duplicates and link (or xe2x80x9cintegratexe2x80x9d) customer data maintained on separate databases has historically been to rebuild the relevant databases from scratch. Since many such databases contain tens of millions of records, the cost of completely rebuilding the databases is often prohibitively expensive. In addition, these databases are constantly in flux as old customers leave, new customers take their place, and customer information changes; thus the rebuild procedure must be periodically repeated to keep all information reasonably current.
Businesses have traditionally turned to information service providers for data integration and duplicate elimination services. The information services industry has devoted enormous resources in recent years to developing various xe2x80x9cdedupingxe2x80x9d solutions. These solutions are performed after-the-fact, that is, after the instantiation of the duplicate entries within the data owner""s system. To determine if data records for Sue Smith in Memphis and Sue Thompson in Minneapolis pertain to the same person, a deduping routine may analyze a myriad of data fields; simply comparing names and addresses will fail to achieve a match. Even in the case where the name and address are the same, this may not indicate that the records pertain to the same individual, since, for example, the data may pertain to a father and his namesake son. The fact that many databases contain largely incomplete data makes this problem even more difficult to solve, and in many cases makes a complete solution impossible.
Although deduping routines are necessarily complex, they must also be performed with great speed. These routines are used to dedupe databases having tens of millions of records. With such large databases, the software subroutine that performs the deduping function may be called millions of times during a single deduping session. Thus these subroutines must be executed on very fast, expensive computer equipment that has the necessary power to complete the deduping routine in a reasonable amount of time. Because duplicate elimination is so resource-intensive, such tasks are today performed only by information service providers or data owners that have access to the massive computing power necessary to efficiently perform these routines.
In addition, deduping routines necessarily involve some guesswork. As explained above, duplicate elimination is based on the available data, which may be incomplete. The results of duplicate elimination routines are thus only as good as the available information. Because of the inherent ambiguities in name and address information, no system can eliminate 100% of the duplicates in a customer database; inevitably, the resulting database will contain instances of multiple records for the same customer, and multiple customers merged into one record as if they were a single customer.
Historically, the procedure by which an information service provider integrates a business""s databases has been time consuming and labor intensive. Since a wide variety of database formats are in use, the information service provider must first convert the database source files to a standard format for processing. The information service provider then runs one of the complex deduping programs as explained above. The data in the business""s databases may be augmented with external sources of information to improve the accuracy of the deduping routines. The resulting database file is then reformatted into the business""s database file format to complete the process. This entire procedure requires significant direct involvement by the information service provider""s technical personnel, which is an important factor in the cost of the service.
A significant limitation of this data integration method is that each time the service is requested, the entire process must be repeated. Data integration cannot be performed for a single record at a time, or for only those records that have been updated. This is because the data integration process depends upon the comparison of all of the data records against each other to establish groupings of similar records. Although matching links are usually created during the comparison process, those links are temporary and are lost once the process is complete. The links must be recreated from scratch each time the service is performed. It would be impossible to reuse these links since they are not unique across the universe of all possible customers, and are not maintained by the information services provider.
One of the most significant limitations of the current data integration method is that it cannot be performed in real time; the process is only performed in batch mode. Real-time data integration would be highly desirable since it would allow a retailer or other data owner to provide an immediate, customized response to input for a particular customer. For example, when a particular customer visits a retailer""s web site, it would be desirable to link all available information concerning that customer, and then display a web page that is particularly tailored to that customer""s interests and needs. Another application would be to provide customized coupons or sales information in response to the xe2x80x9cswipingxe2x80x9d of a particular customer""s credit card when a retail purchase is in progress.
Prior-art systems to provide a customized response to customer input are based on the matching of internal customer numbers. For example, some grocery stores distribute xe2x80x9cmemberxe2x80x9d cards containing bar codes to identify a particular customer. When the customer presents his or her member card at the check-out line, the card""s bar code is scanned to determine the customer""s identification number. The grocer""s data processing system then automatically consults its buying history database in order to print coupons that are tailored to that customer""s particular buying habits.
Record-at-a-time processing based on internal customer numbers has several important limitations. First, this system only works for established customers for whom a number has already been assigned. If a new customer enters the store, that customer must be issued a member card (and corresponding customer identification number) before the system will recognize the customer. Initially, the grocer would know nothing about this customer. In addition, this system""s use of customer identification numbers would make it unacceptable for use externally, due to the individual privacy concerns discussed above.
Still another limitation of traditional data integration methods is that they provide no means by which a business can remotely and automatically update or xe2x80x9cenhancexe2x80x9d the data it maintains for each customer when the data concerning that customer changes. The traditional, batch-mode method of providing update or enhancement data is laborious, and may require several weeks from start to finish. First, the company requesting data enhancement is required to build an xe2x80x9cextract filexe2x80x9d containing an entry for each record in its customer database. This extract file is stored on a computer-readable medium, such as magnetic tape, which is then shipped to the information service provider for enhancement. Since a wide variety of database formats are in use, the information service provider must first convert the extract file to the information service provider""s internal format for processing. Using this standardized version of the extract file, the information service provider then executes a software application that compares the information in the company""s database against all of the information that the information service provider maintains. The update or enhancement data is then overlaid onto the company""s standardized extract file.
An important limitation of this data update and enhancement method is that the business""s database must be rebuilt even when it only requires an update to a small portion of the data. For example, a retailer may desire to update the addresses in its customer database once per month. Most customers will not have changed their address within each one-month period; the traditional update method, however, would require the retailer to completely rebuild the database to catch those few customers who have moved.
For all of these reasons, it would be desirable to develop an unambiguous data-linking system that will improve data integration, update, and enhancement; will perform record-at-a-time, real-time data linking; and may be used externally without raising privacy concerns.
The present invention is directed to a system and method for using permanent xe2x80x9ctokensxe2x80x9d to create an unambiguous linking scheme to match related data. Tokens may be implemented as unique numbers that are used to tag all data pertaining to a particular entity. These tokens are created by an information services provider, and may be distributed externally for the use of its customers. Unlike the customer identification numbers discussed above, the creation of tokens is not dependent upon a customer approaching the data owner. The information services provider that creates the tokens may maintain databases with information pertaining to the entire United States population, and constantly monitors the population for changes of address, name, status, and other demographic data in order to keep the list of tokens current. New tokens are assigned as new entities are identified.
To maintain the uniqueness of each token, the tokens are created only by a single central repository operated by the information services provider. Temporary tokens may be created initially when a new entity is encountered, so that the information services provider may collect additional data to confirm that the supposed new entity is not already in the database. Once the information services provider confirms that the entity is actually new, however, a permanent token will be assigned that will be used to link data pertaining to that entity for all time. Because even the information service provider""s information will not be complete, it may be necessary to periodically perform token maintenance in the form of combining two or more tokens into a single token, or splitting a single token into two different tokens. This process may be performed simply by publishing a list of consolidated and split tokens that is transmitted to all token users. This maintenance method makes unnecessary the complete reprocessing of a database to keep tokens current.
Because the tokens are created at a central repository that is maintained by an information services provider, ambiguities may be resolved far more effectively than in prior art systems. The central repository may create an identification class that contains all available data pertaining to each entity for which information is maintained. The purpose of the identification class is to link all available data concerning a particular entity using the appropriate token. Even though much of this information may never be distributed, it may still be used in the matching process to assure that the correct token is assigned to a customer""s data in response to a data integration, update, or enhancement request. The identification class may include name aliases, common name misspellings, last name change history, address history, street aliases, and other relevant information useful for matching purposes. The identification-class structure enables far more accurate matching and xe2x80x9cdedupingxe2x80x9d than previously possible; for example, by using known name aliases, the central repository may recognize that a customer""s separate database records for xe2x80x9cSue C. Smith,xe2x80x9d xe2x80x9cCarol Smith,xe2x80x9d and xe2x80x9cSue Thompsonxe2x80x9d each actually refer to the same person, and would accurately assign a single token to link all relevant information about this person.
Since the tokens are permanent and are universally unique, they are not limited to use by a particular data provider, or to a particular matching session; instead, the tokens are specifically intended for external distribution to any owner of relevant data, and will never expire. Once a data owner receives the tokens and matches them to its existing data, the tokens can be used to rapidly compare, match, search, and integrate data from multiple internal databases, either in batch mode or real time, using as few as one record at a time.
Different types of tokens may be used to link data relevant to, for example, customers, businesses, addresses, households, and occupancies. An occupancy token links information about a customer or business and the address at which that particular customer resides at a particular time. A household token links information about all persons who are determined to share a household. The definition of what constitutes a xe2x80x9chouseholdxe2x80x9d may vary from one application to another; therefore, there may be multiple types of household tokens in use simultaneously. A series of linked address tokens can further be used to maintain an individual""s address history. Using an address history, ambiguities caused by name similarity between individuals may be more easily resolved, and the correct token will be tagged to that individual""s data despite a change in address.
As noted above, prior art xe2x80x9cdedupingxe2x80x9d routines are complex, resource-intensive, and, because they are limited to the available data, cannot perform with 100% accuracy. With the present invention, however, adding new data to a data processing system is as simple as matching tokens against one another. Token matching is a computationally simple process that can be performed as the data is added to the data processing system in real time. Because no inadvertent duplicates are added to the database during data update or enhancement, periodic efforts to remove duplicates are unnecessary.
The present invention also uses tokens to greatly simplify the process of data integration where multiple databases are maintained. When all known information about a particular entity is required, the data owner need only search each database for information that is linked by the token associated with the entity of interest. There is no need to perform complex matching algorithms designed to determine whether, for example, two customers about whom information is maintained on separate databases are in fact the same individual. The tokens thus enable the data owner to treat each of its physically remote databases as if they were a single xe2x80x9cvirtualxe2x80x9d database in which all information about a particular entity is readily accessible.
The use of tokens for linking data also significantly reduces the privacy concerns related to data enhancement, data integration, and related data processing. Once the appropriate tokens are matched to the data owner""s data, update and enhancement requests may be transmitted to an information services provider as simply a list of tokens. The tokens themselves contain no information concerning the data to which they pertain. Thus anyone who clandestinely intercepts such a transmission would be unable to extract any private data from the transmission. In addition, since the tokens are merely data links, and not PINs or customer identification numbers, there is no increased individual-privacy risk associated with the external use of the tokens.
The tokens further allow real-time, record-at-a-time linking for the immediate collection of all relevant data in response to customer input. By collecting all data for a particular customer, the data owner is able to construct a xe2x80x9ctotal customer viewxe2x80x9d that may be used, for example, to customize the interaction between the data owner and its customer. If multiple databases must be consulted to retrieve all relevant customer data, then each database need only be searched for data linked to the relevant token. The data owner can use the tokens to link all of its own data, or can link with data maintained by an information services provider to immediately enhance its data pertaining to a particular customer. Because the linking process is performed just at the moment when the customer input is received, the data retrieved will be the most recently updated customer information available. The linkage between the data owner""s database and information provider""s database may be by OLTP (on-line transactional processing) using the linking tokens. This linkage may also be used to perform xe2x80x9ctrigger notification.xe2x80x9d Trigger notification is the automatic triggering of update messages to every linked database when new information is received about a particular entity. Using tokens, trigger notification may taken place almost instantaneously, allowing, for example, every division of a large retailer to take advantage of the latest information received from a customer.
Another advantage of the record-at-a-time processing is that data may be xe2x80x9cpushedxe2x80x9d from the information services provider to its customers. For example, the information services provider may learn that a particular individual""s name has changed. This change can be xe2x80x9cpushedxe2x80x9d to a customer""s database automatically through the use of a message that contains the new information and the token used to link all data pertaining to this individual. Because the update process requires only the matching of tokens, the process may be performed automatically without direct intervention by either the information services provider or its customer.
One concern that arises in connection with an information service provider""s external distribution of data is the inadvertent distribution of one company""s data to that company""s competitor. For example, company A may wish to link its data using tokens. The information service provider may already have information in its matching database about company A""s customers that was obtained from company B, company A""s competitor. The information services provider must be able to assure company B that its private data will not be distributed to company A. The use of tokens in the present invention, however, makes this xe2x80x9cscreeningxe2x80x9d process automatic. The information services provider may use the data of both companies as part of its internal token creation and linkage processes. But by returning only the information received from a company along with the linked tokens, the company receiving the tokens does not obtain anyone""s data but its own. Because the tokens themselves reveal no private company information, there is no requirement to implement a separate xe2x80x9cscreeningxe2x80x9d function. Also, because the information service provider uses all available data to generate and link tokens, the correct tokens may still be distributed to companies with incomplete or partially inaccurate data.
It is therefore an object of the present invention to provide a data processing system using permanent tokens.
It is a further object of the present invention to provide a data processing system using tokens that are universally unique.
It is a still further object of the present invention to provide for the integration of data across multiple internal databases using tokens.
It is also an object of the present invention to provide for automatic duplicate elimination on a database using tokens.
It is another object of the present invention to provide for data update and enhancement using tokens.
It is still another object of the present invention to provide real-time, record-at-a-time processing of data using tokens.
It is still another object of the present invention to provide linkage capability for the creation of a total customer view from physically separate databases in real time using tokens.
It is still another object of the present invention to create a customized response to customer input in real time using tokens.
It is still another object of the present invention to perform trigger notification using tokens.
It is still another object of the present invention to automatically push update data from a central repository to a customer database using tokens.
Further objects and advantages of the present invention will be apparent from a consideration of the following detailed description of the preferred embodiments in conjunction with the appended drawings as briefly described following.