Data about an entity, such as a subject, company, idea or the like, may be stored in a plurality of disparate data sources. In order to be able to assemble the data about the entity from the disparate sources into a single data store, it is necessary to try to gather the various data from the various data sources and then determine a way to combine the data from the disparate data sources for the particular entity into the single data store.
In the healthcare industry, information/data about each healthcare provider, such as a doctor, a therapist, a nurse, a hospital, a medical practice and the like, may be stored in a plurality of disparate data sources. The information/data about the healthcare provider may include, for example, reviews, directions, rates and the like. The disparate data sources for the data/information for the healthcare provider may include publicly available Centers for Medicare and Medicaid Services' (CMS) National Plan and Provider Enumeration System (NPPES) data to privately curated and licensed data from the American Medical Association (AMA), among others.
The issues that must be confronted in order to successfully integrate the data from these various data sources into a single data store may include:                While the provider documents are structured, the available data fields are heterogeneous across data sources.        There is no strong identifier linking provider documents across data sources. Even a provider's name may be suspect for a number of reasons:                    Names may be legally changed            Informal variations (i.e., nicknames)            Misspellings due to human error            Inconsistent localization from non-Roman alphabets            Multiple providers with the same name                        No single data source can be trusted as authoritative, as there is no central mechanism in place to update each concerned organization as provider information changes over time.        As there is no central mechanism for updating provider information, the data available from NPPES, AMA and others invariably become out of sync even among the commonly available data fields.        While the data from NPPES, AMA and others provide a top-level view of their own provider directories, they too have combined data from potentially thousands of lower level sources, and errors may have propagated through their own system.        There are more than one million individual healthcare providers in the United States, and manual curation and inspection of all providers' data is not feasible.        
Thus, it is desirable to provide a system and method for dynamic data identification and combining so that, for example, data from disparate data sources for a healthcare provider may be combined into a single data store.