Data about entities, such as people, products, or parts may be stored in digital format in a computer database. These computer databases permit the data about an entity to be accessed rapidly and permit the data to be cross-referenced to other relevant pieces of data about the same entity. The databases also permit a person to query the database to find data records pertaining to a particular entity. The terms data set, data file, and data source may also refer to a database. A database, however, has several limitations which may limit the ability of a person to find the correct data about an entity within the database. The actual data within the database is only as accurate as the person who entered the data. Thus, a mistake in the entry of the data into the database may cause a person looking for data about an entity in the database to miss some relevant data about the entity because, for example, a last name of a person was misspelled. Another kind of mistake involves creating a new separate record for an entity that already has a record within the database. In a third problem, several data records may contain information about the same entity, but, for example, the names or identification numbers contained in the two data records may be different so that the database may not be able to associate the two data records to each other.
For a business that operates one or more databases containing a large number of data records, the ability to locate relevant information about a particular entity within and among the respective databases is very important, but not easily obtained. Once again, any mistake in the entry of data (including without limitation the creation of more than one data record for the same entity) at any information source may cause relevant data to be missed when the data for a particular entity is searched for in the database. In addition, in cases involving multiple information sources, each of the information sources may have slightly different data syntax or formats which may further complicate the process of finding data among the databases. An example of the need to properly identify an entity referred to in a data record and to locate all data records relating to an entity in the health care field is one in which a number of different hospitals associated with a particular health care organization may have one or more information sources containing information about their patient, and a health care organization collects the information from each of the hospitals into a master database. It is necessary to link data records from all of the information sources pertaining to the same patient to enable searching for information for a particular patient in all of the hospital records.
There are several problems which limit the ability to find all of the relevant data about an entity in such a database. Multiple data records may exist for a particular entity as a result of separate data records received from one or more information sources, which leads to a problem that can be called data fragmentation. In the case of data fragmentation, a query of the master database may not retrieve all of the relevant information about a particular entity. In addition, as described above, the query may miss some relevant information about an entity due to a typographical error made during data entry, which leads to the problem of data inaccessibility. In addition, a large database may contain data records which appear to be identical, such as a plurality of records for people with the last name of Smith and the first name of Jim. A query of the database will retrieve all of these data records and a person who made the query to the database may often choose, at random, one of the data records retrieved which may be the wrong data record. The person may not often typically attempt to determine which of the records is appropriate. This can lead to the data records for the wrong entity being retrieved even when the correct data records are available. These problems limit the ability to locate the information for a particular entity within the database.
To reduce the amount of data that must be reviewed and prevent the, user from picking the wrong data record, it is also desirable to identify and associate data records from the various information sources that may contain information about the same entity. There are conventional systems that locate duplicate data records within a database and delete those duplicate data records, but these systems only locate data records which are identical to each other. Thus, these conventional systems cannot determine if two data records, with for example slightly different last names, nevertheless contain information about the same entity. In addition, these conventional systems do not attempt to index data records from a plurality of different information sources, locate data records within the one or more information sources containing information about the same entity, and link those data records together.
These information sources may also impose hierarchical relationships among the various data records pertaining to different entities. These hierarchies may designate a variety of relationships between entities, such as social hierarchies (business organization, army chain of command, and church organization), containment hierarchies (biological taxonomy, geometric subsets, assemblies, bill of materials), genealogy hierarchies, or other parent-child data relationships. Thus, not only is it desirable to identify and associate data records from various data sources, but it may also be desirable to associate data records with a data records in an existing or known hierarchy.
For example, a company may have multiple suppliers of parts where the suppliers may belong to a hierarchy of parent companies and there is a need to determine the level of business with a particular parent company on an ongoing basis. Multiple information sources may contain the different orders for parts from individual companies, while another 3rd party source (such as Dunn & Bradstreet, Equifax, infoUSA, etc.) identifies the parent company hierarchy. It may be desirable to link part suppliers to the hierarchy to determine the amount of business with any particular parent company.
In addition to the problems discussed above with respect to entity matching, the ability to match data records to known hierarchies may present additional problems such as that there may be missing parts of the hierarchy, a data record may match to more than one node of a hierarchy tree, a data record may match to nodes on two separate hierarchy trees or a data record which is a node on one hierarchy tree may match to a node on another hierarchy tree and thus it may be necessary to reconcile the two hierarchy trees with one another.
Thus there is a need for a system and method for indexing information about entities/hierarchies from a plurality of different information sources which avoid these and other problems of known systems and methods, and it is to this end that the present invention is directed.