The invention is directed to a system and method for linking data that pertains to like entities. In particular, the invention is directed to a system and method for linking data where the links are encoded to discourage collusion by entities to whom the links are issued.
Virtually all businesses today find it necessary to keep computerized databases containing information about their customers. Such information can be used in a variety of ways, such as for billing, and for keeping consumers informed as to sales and new products. This information is typically stored electronically as a series of records in a computer database, each record pertaining to a particular customer. Records are logical constructs that may be implemented in a computer database in any number of ways well known in the art. The database used may be flat, relational, or may take any one of several other known forms. Each record in the database may contain various fields, such as the customer""s first name, last name, street address, city, state, and zip code. The records may also include more complex demographic data, such as the customer""s marital status, estimated income, hobbies, or purchasing history.
Businesses generally gather customer data from a multitude of sources. These sources may be internal, such as customer purchases, or external, such as data provided by information service providers. A number of information service providers maintain large databases with broad-based consumer information that can be sold or leased to businesses; for example, a catalog-based retail business may purchase a list of potential customers in a specific geographic area.
Because businesses use varying methods to collect customer data, they often find themselves with several large but entirely independent databases that contain redundant information about their customers. These businesses have no means by which to accurately link all of the information concerning a particular customer. One common example of this problem is a bank that maintains a database for checking and savings account holders, a separate database for credit card holders, and a separate database for investment clients. Another common example is a large retailer that has separate databases supporting each of its divisions or business lines, which may include, for example, automotive repair, home improvement, traditional retail sales, e-commerce, and optometry services.
Businesses with multiple, independent databases may find it particularly valuable to know who among their customers come to them for multiple services. For example, a bank may wish to offer an enhanced suite of banking services to a customer that maintains only $100 in his or her savings account, if the bank could also determine that this same individual maintains a $100,000 brokerage account. This information could also be valuable, for example, to take advantage of cross-selling opportunities and to assist the business in optimizing the mix of services to best serve its existing customer base.
Linking all available data concerning each customer would also allow each of the business""s divisions to have access to the most up-to-date information concerning each customer. For example, a customer may get married and relocate, then notify only one of the business""s divisions concerning the change. Suppose that Sue Smith, a long-time and valued customer residing in Memphis, becomes Sue Thompson, residing in Minneapolis. If only one of the business""s data processing systems xe2x80x9cknowsxe2x80x9d about the change, the other systems would be unable to determine that Sue Thompson in Minneapolis is the same person as Sue Smith in Memphis. This problem may prevent the business from treating a customer as befitting that customer""s value to the business. Treating a long-time customer as if she were a new customer would likely be found insulting, and may even result in a loss of that customer""s business.
One of the oldest methods used to combat this problem is simply to assign a number to every customer, and then perform matching, searching, and data manipulation operations using that number. Many companies that maintain large, internal customer databases have implemented this type of system. In theory, each customer number always stays the same for each customer, even when that customer changes his or her name or address. These numbers may be used internally, for example, for billing and for tracking packages shipped to that customer. The use of a customer identification number eliminates the potential ambiguities if, for example, the customer""s name and address were instead used as identifiers. Financial institutions in particular have used personal identification numbers (PINs) to unambiguously identify the proper customer to which each transaction pertains.
Customer number systems are inherently limited to certain applications. Customer identification numbers are not intended to manage a constantly changing, nationwide, comprehensive list of names and addresses. Companies maintaining these numbers are generally only interested in keeping up with their own customers. Thus the assignment process for such numbers is quite simplexe2x80x94when a customer approaches the company seeking to do business, a new number is assigned to that customer. The customer numbers are not the result of a broad-based process capable of managing the address and name history for a given customer. Significantly, the customer numbers are assigned based only on information presented to the business creating the numbers by the customers themselves. The numbers are not assigned from a multi-sourced data repository that functions independently of the company""s day-to-day transactions. In short, the purpose of such numbers is simply transaction management, not universal data linkage. Such numbers are also not truly persistent, since they are typically retired by the company after a period of inactivity. Again, since the focus of the customer number assignment scheme is merely internal business transactions, there is no reason to persistently maintain a number for which no transactions are ongoing. These numbers cannot be used externally to link data because every company maintains a different set of customer numbers.
Although externally applied universal numbering systems have not been used for consumers, they have been made publicly available for use with retail products. The universal product code (UPC) system, popularly known as xe2x80x9cbar codes,xe2x80x9d began in the early 1970""s when a need was seen in the grocery industry for a coding system that was common to all manufacturers. Today, the Uniform Code Council, Inc. (UCC) is responsible for assigning all bar codes for use with retail products, thereby maintaining a unique UPC number for every product regardless of the manufacturer. A database of these codes is made publicly available so that the codes can be used by everyone. Using this database, every retailer can track price and other information about each product sitting on its shelves. Today""s product distribution chains also rely heavily on the UPC system to track products and make determinations concerning logistics and distribution channels.
While the UPC system has been enormously successful, the system""s usefulness is limited. To obtain a UPC number for a new product, a manufacturer first applies for a UPC number, the product and number are added to the UCC database, and then the manufacturer applies the proper bar coding to its products before they are distributed. There is no scheme for assigning UPC numbers to pre-existing products, and no scheme for matching UPC numbers to the products they represent. Also, since each UPC number represents a single, distinct item packaged for retail sale, there is no scheme for identifying the various elements of a particular product to which a single UPC number is assigned. The UPC system thus could not be used to link various pre-existing data pertaining to consumers.
A final but vitally important issue raised by the use of any identification number system with respect to individuals is privacy. A company""s internal-only use of a customer identification number raises few privacy concerns. But the external use of a customer number or PIN with respect to an individual increases the risk that the individual""s private data may be used or shared with others in an unauthorized or illegal manner. This problem is of particular concern in the case of an information services provider that issued customer identification numbers or PINs as a means of tracking data on its clients"" customer databases. Such companies typically have a large number of clients, many of whom may have substantially overlapping customer lists. These clients may wish to surreptitiously share their data with each other in order to gain more information about their own customers. Such clients might find it relatively simple to merge or otherwise use their customer databases in a collective manner based on simple matching of customer information numbers. If clients were to collude in such an effort, the information services provider would no longer be able to control how the information is being used or shared. The information services provider""s clients could thus circumvent whatever protections the information services provider might have put in place to prevent the misuse of personal information.
Another limitation of customer identification number systems is the method used to merge files and eliminate duplicate entries. The only comprehensive method to eliminate duplicates in such systems and link (or xe2x80x9cintegratexe2x80x9d) customer data maintained on separate databases has historically been to rebuild the relevant databases from scratch. Since many such databases contain tens of millions of records, the cost of completely rebuilding the databases is often prohibitively expensive. In addition, these databases are constantly in flux as old customers leave, new customers take their place, and customer information changes; thus the rebuild procedure must be periodically repeated to keep all information reasonably current.
Businesses have traditionally turned to information service providers for data integration and duplicate elimination services. The information services industry has devoted enormous resources in recent years to developing various xe2x80x9cdedupingxe2x80x9d solutions. These solutions are performed after-the-fact, that is, after the instantiation of the duplicate entries within the data owner""s system. To determine if data records for Sue Smith in Memphis and Sue Thompson in Minneapolis pertain to the same person, a deduping routine may analyze a myriad of data fields; simply comparing names and addresses will fail to achieve a match in many cases. Even in the case where the name and address are the same, this may not indicate that the records pertain to the same individual, since, for example, the data may pertain to a father and his namesake son. The fact that many databases contain largely incomplete or inaccurate data makes this problem even more difficult to effectively solve, and in many cases a complete solution is impossible.
Although deduping routines are necessarily complex, they must also be performed with great speed. These routines are used to dedupe databases having tens of millions of records. With such large databases, the software subroutine that performs the deduping function may be called millions of times during a single deduping session. Thus these subroutines must be executed on very fast, expensive computer equipment that has the necessary power to complete the deduping routine in a reasonable amount of time. Because duplicate elimination is so resource-intensive, such tasks are today performed only by information service providers or data owners that have access to the massive computing power necessary to efficiently perform these routines.
In addition, deduping routines necessarily involve some guesswork. As explained above, duplicate elimination is based on the available data, which may be incomplete. The results of duplicate elimination routines are thus only as good as the available information. Because of the inherent ambiguities in name and address information, no system can eliminate 100% of the duplicates in a customer database; inevitably, the resulting database will contain instances of multiple records for the same customer, and multiple customers merged into one record as if they were a single customer. A well-known result of this problem is the customer who receives several copies of the same catalog from a mail-order retailer. Such experiences are frustrating for the customer and result in increases costs for the retailer.
Historically, the procedure by which an information service provider integrates a business""s databases has been time consuming and labor intensive. Since a wide variety of database formats are in use, the information service provider must first convert the database source files to a standard format for processing. The information service provider then runs one of the complex deduping programs as explained above. The data in the business""s databases may be augmented with external sources of information to improve the accuracy of the deduping routines. The resulting database file is then reformatted into the business""s database file format to complete the process. This entire procedure requires significant direct involvement by the information service provider""s technical personnel, which is an important factor in the cost of the service.
A significant limitation of this data integration method is that each time the service is requested, the entire process must be repeated. Data integration cannot be performed for a single record at a time, or for only those records that have been updated. This is because the data integration process depends upon the comparison of all of the data records against each other to establish groupings of similar (and thus possibly duplicate) records. Although matching links are usually created during the comparison process, those links are temporary and are lost once the process is complete. The links must be recreated from scratch each time the service is performed. It would be impossible to reuse these links since they are not unique across the universe of all possible customers, and are not maintained by the information services provider.
One of the most significant limitations of the current data integration method is that it cannot be performed in real time; the process is only performed in batch mode. Real-time data integration would be highly desirable since it would allow a retailer or other data owner to provide an immediate, customized response to input from a particular customer. For example, when a particular customer visits a retailer""s web site, it would be desirable to link all available information concerning that customer, and then display a web page that is particularly tailored to that customer""s interests and needs. Another application would be to provide customized coupons or sales information in response to the xe2x80x9cswipingxe2x80x9d of a particular customer""s credit card when a retail purchase is in progress.
Prior-art systems to provide a customized response to customer input are based on the matching of internal customer numbers. For example, some grocery stores distribute xe2x80x9cmemberxe2x80x9d cards containing bar codes to identify a particular customer. When the customer presents his or her member card at the check-out line, the card""s bar code is scanned to determine the customer""s identification number. The grocer""s data processing system then automatically consults its buying history database in order to print coupons that are tailored to that customer""s particular buying habits.
Record-at-a-time processing based on internal customer numbers has several important limitations. First, this system only works for established customers for whom a number has already been assigned. If a new customer enters the store, that customer must be issued a member card (and corresponding customer identification number) before the system will recognize the customer. Initially, the grocer would know nothing about this customer. In addition, this system""s use of customer identification numbers would make it unacceptable for use externally, due to the individual privacy concerns discussed above.
Still another limitation of traditional data integration methods is that they provide no means by which a business can remotely and automatically update or xe2x80x9cenhancexe2x80x9d the data it maintains for each customer when the data concerning that customer changes. The traditional, batch-mode method of providing update or enhancement data is laborious, and may require several weeks from start to finish. First, the company requesting data enhancement is required to build an xe2x80x9cextract filexe2x80x9d containing an entry for each record in its customer database. This extract file is stored on a computer-readable medium, such as magnetic tape, which is then shipped to the information service provider for enhancement. Since a wide variety of database formats are in use, the information service provider must first convert the extract file to the information service provider""s internal format for processing. Using this standardized version of the extract file, the information service provider then executes a software application that compares the information in the company""s database against all of the information that the information service provider maintains. The update or enhancement data is then overlaid onto the company""s standardized extract file.
An important limitation of this data update and enhancement method is that the business""s database must be rebuilt even when it only requires an update to a small portion of the data. For example, a retailer may desire to update the addresses in its customer database once per month. Most customers will not have changed their address within each one-month period; the traditional update method, however, would require the retailer to completely rebuild the database to catch those few customers who have moved.
For all of these reasons, it would be desirable to develop an unambiguous data-linking system that will improve data integration, update, and enhancement; will perform record-at-a-time, real-time data linking; and may be used externally without raising privacy concerns.
The present invention is directed to a system and method for using persistent links to create an unambiguous linking scheme to match related data. Links may be implemented as unique alphanumeric strings that are used to tag all data pertaining to a particular entity. These links are created by an information services provider, and may be distributed externally for the use of its customers in an encoded form. Unlike the customer identification numbers discussed above, the creation of links is not dependent upon a customer approaching the data owner. The information services provider that creates the links may maintain databases with information pertaining to the entire population of a country or other area of interest, and constantly monitors the population for changes of address, name, status, and other demographic data in order to keep the list of links current. New links are assigned as new entities are identified.
The present invention is further directed to a method of encoding the links. Issuing the same links to a multitude of clients would allow for the possibility of the clients working cooperatively to share information amongst themselves without the involvement of the information services provider. According to the present invention, the links are encoded with a client-specific key before being issued externally. When the information services provider again accesses that particular client""s data, the client-specific key will be used to decode the client""s links. In this manner, internal processing by the information services provider may always be performed with unencoded links. This encoding scheme will make it difficult for clients to share information in an unauthorized manner. The encoding technique is chosen such that decoding would be a sufficiently difficult task as to render it commercially impractical. In addition, one embodiment of the invention comprises the use of multiple encoding algorithms to make unauthorized sharing of data even more difficult. In a preferred embodiment, the link is divided into various fields, such that only the portion of the field corresponding to a customer""s identity need be encoded.
To maintain the uniqueness of each link, the links are created only by a single central repository operated by the information services provider and only used internally by this provider. Because even the information service provider""s information will not be complete, it may be necessary to periodically perform link maintenance in the form of combining two or more links into a single link, or splitting a single link into two different links. This process may be performed simply by publishing a list of consolidated and split links that is transmitted to all link users. This maintenance method makes unnecessary the complete reprocessing of a database to keep links current.
Because the links are created at a central repository that is maintained by an information services provider, ambiguities may be resolved far more effectively than in prior art systems. The central repository may create an identification class that contains all available data pertaining to each entity for which information is maintained. The purpose of the identification class is to link all available data concerning a particular entity using the appropriate link. Even though much of this information may never be distributed, it may still be used in the matching process to assure that the correct link is assigned to a customer""s data in response to a data integration, update, or enhancement request. The identification class may include name aliases, common name misspellings, last name change history, address history, street aliases, and other relevant information useful for matching purposes. The identification-class structure enables far more accurate matching and xe2x80x9cdedupingxe2x80x9d than previously possible; for example, by using known name aliases, the central repository may recognize that a customer""s separate database records for xe2x80x9cSue C. Smith,xe2x80x9d xe2x80x9cCarol Smith,xe2x80x9d and xe2x80x9cSue Thompsonxe2x80x9d each actually refer to the same person, and would accurately assign a single link to link all relevant information about this person.
Since the links are persistent and are universally unique within a domain, they are not limited to use by a particular data provider, or to a particular matching session; instead, the links are specifically intended for external distribution to any owner of relevant data, and will never expire. Once a data owner receives the links and matches them to its existing data, the links can be used to rapidly compare, match, search, and integrate data from multiple internal databases, either in batch mode or real time, using as few as one record at a time.
Different types of links may be used to link data relevant to, for example, individual customers, businesses, addresses, households, and occupancies. An occupancy link pertains to information about a customer or business and the address at which that particular customer resides at a particular time. A household link pertains to information about all persons who are determined to share a household. The definition of what constitutes a xe2x80x9chouseholdxe2x80x9d may vary from one application to another; therefore, there may be multiple types of household links in use simultaneously. A series of linked address links can further be used to maintain an individual""s address history. Using an address history, ambiguities caused by name similarity between individuals may be more easily resolved, and the correct link will be tagged to that individual""s data despite a change in address.
As noted above, prior art xe2x80x9cdedupingxe2x80x9d routines are complex, resource-intensive, and, because they are limited to the available data, cannot perform with 100% accuracy. With the present invention, however, adding new data to a data processing system is as simple as matching links against one another. Link matching is a computationally simple process that can be performed as the data is added to the data processing system in real time.
The present invention also uses links to greatly simplify the process of data integration where multiple databases are maintained. When all known information about a particular entity is required, the data owner need only search each database for information that is linked by the link associated with the entity of interest. There is no need to perform complex matching algorithms designed to determine whether, for example, two customers about whom information is maintained on separate databases are in fact the same individual. The links thus enable the data owner to treat each of its physically remote databases as if they were collectively a single xe2x80x9cvirtualxe2x80x9d database in which all information about a particular entity is readily accessible.
The use of links for linking data also significantly reduces the privacy concerns related to data enhancement, data integration, and related data processing. Once the appropriate links are matched to the data owner""s data, update and enhancement requests may be transmitted to an information services provider as simply a list of links. The links themselves contain no information concerning the data to which they pertain. Thus anyone who clandestinely intercepts such a transmission would be unable to extract any private data from the transmission. In addition, since the links are merely data links, and not PINs or customer identification numbers, there is no increased individual-privacy risk associated with the external use of the links.
The links further allow real-time, record-at-a-time linking for the immediate collection of all relevant data in response to customer input. By collecting all data for a particular customer, the data owner is able to construct a xe2x80x9ctotal customer viewxe2x80x9d that may be used, for example, to customize the interaction between the data owner and its customer. If multiple databases must be consulted to retrieve all relevant customer data, then each database need only be searched for data linked to the relevant link. The data owner can use the links to link all of its own data, or can link with data maintained by an information services provider to immediately enhance its data pertaining to a particular customer. Because the linking process is performed just at the moment when the customer input is received, the data retrieved will be the most recently updated customer information available. The linkage between the data owner""s database and information provider""s database may be by OLTP (on-line transactional processing) using the links. This linkage may also be used to perform xe2x80x9ctrigger notification.xe2x80x9d Trigger notification is the automatic triggering of update messages to every linked database when new information is received about a particular entity. Using links, trigger notification may taken place almost instantaneously, allowing, for example, every division of a large retailer to take advantage of the latest information received from a customer.
Another advantage of the record-at-a-time processing is that data may be xe2x80x9cpushedxe2x80x9d from the information services provider to its customers. For example, the information services provider may learn that a particular individual""s name has changed. This change can be xe2x80x9cpushedxe2x80x9d to a customer""s database automatically through the use of a message that contains the new information and the link used to link all data pertaining to this individual. Because the update process requires only the matching of links, the process may be performed automatically without direct intervention by either the information services provider or its customer.
One concern that arises in connection with an information service provider""s external distribution of data is the inadvertent distribution of one company""s data to that company""s competitor. For example, company A may wish to have links applied to its data for one of many reasons. The information service provider may already have information in its matching database about company A""s customers that was obtained from company B, company A""s competitor. The information services provider must be able to assure company B that its private data will not be shared with company A. The use of links in the present invention makes this xe2x80x9cscreeningxe2x80x9d process automatic. The information services provider may use the data of both companies as part of its internal link creation and linkage processes. But by returning only the information received from a company along with the links, the company receiving the links does not obtain anyone""s data but its own. Because the links themselves reveal no private company information, there is no requirement to implement a separate xe2x80x9cscreeningxe2x80x9d function. Also, because the information service provider uses all available data to generate and append the links, the correct links may still be distributed to companies with incomplete or partially inaccurate data.
It is therefore an object of the present invention to provide a data processing system using persistent links.
It is a further object of the present invention to provide a data processing system using links that are universally unique.
It is a still further object of the present invention to provide for the integration of data across multiple internal databases using links.
It is also an object of the present invention to provide for automatic duplicate elimination on a database using links.
It is another object of the present invention to provide for data update and enhancement using links.
It is still another object of the present invention to provide for the encoding of links with a client-specific key.
It is still another object of the present invention to provide a plurality of encoding algorithms for links provided to different clients.
It is still another object of the present invention to provide real-time, record-at-a-time processing of data using links.
It is still another object of the present invention to provide linkage capability for the creation of a total customer view from physically separate databases in real time using links.
It is still another object of the present invention to create a customized response to customer input in real time using links.
It is still another object of the present invention to perform trigger notification using links.
It is still another object of the present invention to automatically push update data from a central repository to a customer database using links.
Further objects and advantages of the present invention will be apparent from a consideration of the following detailed description of the preferred embodiments in conjunction with the appended drawings as briefly described following.