Field of the Invention (Technical Field)
This invention relates generally to associating data records and in particular to identifying data records that may contain the same entity such that these data records may be associated. Even more particularly, this invention relates to the standardization and comparison of attributes within data records.
Background
In today's day and age, the vast majority of businesses retain extensive amounts of data regarding various aspects of their operations, such as inventories, customers, products, etc. Data about entities, such as people, products, parts or anything else may be stored in digital format in a data store such as a computer database. These computer databases permit the data about an entity to be accessed rapidly and permit the data to be cross-referenced to the relevant pieces of data about the same entity. The databases also permit a person to query the database to find data records pertaining to a particular entity, such that data records from various data stores pertaining to the same entity may be associated with one another.
A data store, however, has several limitations, which may limit the ability to find the correct data about an entity within that data store. The actual data within the data store is only as accurate as the person who entered the data, or an original database. Thus, a mistake in the entry of the data into the data store may cause a search for data about an entity in the database to miss relevant data about the entity because, for example, a last name of a person was misspelled or a social security number was entered incorrectly, one or more attributes are missing, etc. A whole host of these types of problems may be imagined: two separate records for an entity that already has a record within the database may be created such that several data records may contain information about the same entity, but, for example, the names or identification numbers contained in the two data records may be different so that it may be difficult to associate the data records referring to the same entity with one another.
There are several problems, which limit the ability to find all of the relevant data about an entity in such a database. For example, multiple data records may exist for a particular entity as a result of separate data records received from one or more information sources, which leads to a problem that can be called data fragmentation. In the case of data fragmentation, a query of the master database may not retrieve all of the relevant information about a particular entity. In addition, as described above, the query may miss some relevant information about an entity due to a typographical error made during data entry, which leads to the problem of data inaccessibility. In addition, a large database may contain data records, which appear to be identical, such as a plurality of records for people with the last name of Smith and the first name of Jim. A query of the database will retrieve all of these data records and a person who made the query to the database may often choose, at random, one of the data records retrieved which may be the wrong data record. The person may not often typically attempt to determine which of the records is appropriate. This can lead to the data records for the wrong entity being retrieved even when the correct data records are available. These problems limit the ability to locate the information for a particular entity within the database.
For multiple data stores, such as websites or apps each operating their own databases containing a large number of data records, the ability to locate all relevant information about a particular entity within and among the respective databases is very important, but not easily obtained. For example, were one database to have the location of an entity like a restaurant and another not, a user searching only the latter database will miss the location information. Also, like the situation with a single database, any mistake in the entry of data including without limitation the creation of more than one data record for the same entity at any information source may cause relevant data to be missed when the data for a particular entity is searched for in the database. In addition, in cases involving multiple information sources, each of the information sources may have slightly different data syntax or formats, which may further complicate the process of finding data among the databases.
To reduce the amount of data that must be reviewed and prevent the user from picking the wrong data record, it is also desirable to identify and associate data records from the various information sources that may contain information about the same entity. This process is commonly referred to as “record linkage”.
The process of combining potentially incomplete information from multiple records in a single database is often referred to as “deduplication”. The association of multiple records from separate databases that can contain possibly incomplete, complementary, overlapping, or conflicting information is referred to as “linkage”. Since these two processes are both essential parts of the work of creating the best combined view of information from multiple sources, they are both considered aspects of record linkage. These two problems of deduplication and linkage share the common problem of first assessing record similarity, also referred to as record distance. In the case of deduplication matching records are either merged or deleted. In the case of linkage, either a link between the records or a merged master record is formed.
There are conventional systems for record linkage that are capable of deduplication, but these systems only locate data records which are identical to each other or use a fixed set of rules to determine if two records are identical. Thus, these conventional systems cannot determine if two data records, with, for example, slightly different last names, nevertheless contain information about the same entity. Other conventional methods are designed to solve this problem such as phonetic methods and string distance methods. Another approach is to break data up into smaller units or tokens, which can then be matched at a higher rate. While an improvement over record linkage of identical records, none of these methods is capable of perfect record linkage.
There have also been efforts to standardize data. For example, the U.S. Post Office has a standardized format for addresses that covers abbreviations, ordering of information, use of special characters. In the event that it was ever necessary to combine that data with another source of data concerning the same entities, the new set of data might not conform to the same standards. This would be an instance where better methods of record linkage would be useful.
People expect to plug in information on a web page or app and get the desired results quickly and accurately. In many cases, this is possible, but there are situations in which this process could be enhanced dramatically with a system that deduplicates, links and aggregates the data from multiple sources into a master database and presents more complete and accurate data to the user in a fashion not possible by searching a single website or app. For example, if an internet user wants to locate a restaurant to visit, the user has an array of search options, website and apps, to choose from, but each of these options uses its own unique database containing its own unique rating system and body of data upon which results will be generated. Thus, a user could search two, three, four or more different well-known sites for information about restaurants and get very different information in terms of, for example, price, location, and other subjective qualities that users of such sites or databases have provided. A user might also receive incomplete or incorrect information in a related attribute such as location or cost. Similarly, multiple sites might have different information to display about each restaurant, e.g., one site might display a star based recommendation system, whereas another might not.
More generally, it would also be desirable for users to be able to search a single source for information pertaining to entities, such as restaurants, and obtain more reliable, extensive, and complete data records from a plurality of information sources, which pertain to the same entity, despite discrepancies between attributes of these data records.
From the above it is clear that there is a need for a new methods of record linkage for data stores that solve the problems inherent in deduplication and linkage with respect to databases, particularly those relating to online websites or apps providing user ratings or other quality measures. It is to this end that embodiments of the present invention are directed.
There is also a need for improved techniques for comparing attributes during deduplication or record linkage as well as choosing the best attribute for use in a merge of linked records into a new database.