In today's day and age, the vast majority of businesses retain extensive amounts of data regarding various aspects of their operations, such as inventories, customers, products, etc. Data about entities, such as people, products, parts or anything else may be stored in digital format in a data store such as a computer database. These computer databases permit the data about an entity to be accessed rapidly and permit the data to be cross-referenced to other relevant pieces of data about the same entity. The databases also permit a person to query the database to find data records pertaining to a particular entity, such that data records from various data stores pertaining to the same entity may be associated with one another.
A data store, however, has several limitations which may limit the ability to find the correct data about an entity within the data store. The actual data within the data store is only as accurate as the person who entered the data, or an original data source. Thus, a mistake in the entry of the data into the data store may cause a search for data about an entity in the database to miss relevant data about the entity because, for example, a last name of a person was misspelled or a social security number was entered incorrectly, etc. A whole host of these types of problems may be imagined: two separate record for an entity that already has a record within the database may be created such that several data records may contain information about the same entity, but, for example, the names or identification numbers contained in the two data records may be different so that it may be difficult to associate the data records referring to the same entity with one other.
For a business that operates one or more data stores containing a large number of data records, the ability to locate relevant information about a particular entity within and among the respective databases is very important, but not easily obtained. Once again, any mistake in the entry of data (including without limitation the creation of more than one data record for the same entity) at any information source may cause relevant data to be missed when the data for a particular entity is searched for in the database. In addition, in cases involving multiple information sources, each of the information sources may have slightly different data syntax or formats which may further complicate the process of finding data among the databases. An example of the need to properly identify an entity referred to in a data record and to locate all data records relating to an entity in the health care field is one in which a number of different hospitals associated with a particular health care organization may have one or more information sources containing information about their patient, and a health care organization collects the information from each of the hospitals into a master database. It is necessary to link data records from all of the information sources pertaining to the same patient to enable searching for information for a particular patient in all of the hospital records.
There are several problems which limit the ability to find all of the relevant data about an entity in such a database. Multiple data records may exist for a particular entity as a result of separate data records received from one or more information sources, which leads to a problem that can be called data fragmentation. In the case of data fragmentation, a query of the master database may not retrieve all of the relevant information about a particular entity. In addition, as described above, the query may miss some relevant information about an entity due to a typographical error made during data entry, which leads to the problem of data inaccessibility. In addition, a large database may contain data records which appear to be identical, such as a plurality of records for people with the last name of Smith and the first name of Jim. A query of the database will retrieve all of these data records and a person who made the query to the database may often choose, at random, one of the data records retrieved which may be the wrong data record. The person may not often typically attempt to determine which of the records is appropriate. This can lead to the data records for the wrong entity being retrieved even when the correct data records are available. These problems limit the ability to locate the information for a particular entity within the database.
To reduce the amount of data that must be reviewed, and prevent the user from picking the wrong data record, it is also desirable to identify and associate data records from the various information sources that may contain information about the same entity. There are conventional systems that locate duplicate data records within a database and delete those duplicate data records, but these systems may only locate data records which are substantially identical to each other. Thus, these conventional systems cannot determine if two data records, with, for example, slightly different last names, nevertheless contain information about the same entity. In addition, these conventional systems do not attempt to index data records from a plurality of different information sources, locate data records within the one or more information sources containing information about the same entity, and link those data records together. Consequently, it would be desirable to be able to associate data records from a plurality of information sources which pertain to the same entity, despite discrepancies between attributes of these data records.
No matter the system utilized to identify and associate data records, however, certain conditions may arise with respect to associating these data records. More specifically, there will almost certainly be cases where data records which should be associated are not (known as false negative) and cases where data records are associated when they do not refer to the same entity (known as a false positive). In certain areas, these conditions may be relatively innocuous, the false negatives and false positives are easily dealt with and no harm may arise. In highly critical areas such as health care industries, however, these conditions may have the potential to cause great harm. This is particularly true for false positives. Mistakenly associating data records which refer to distinct entities may have large ramifications when it comes to the application of medical care and pharmaceuticals.
Thus, there is a need for system and methods for comparing attributes of data records and linking these data records which is operable to filter these linked data records for false positives, and it is to this end that embodiments of the present invention are directed.