Technical Field
This disclosure relates to data integration of large data sets and more specifically to a unified approach that makes data accessible from dissimilar resources.
Related Art
As the availability of data continues to grow, automatic access to different data sets is challenging. Electronic data may be stored in distributed resources with different schemas, formats, and structures. Before data mining may process distributed data the systems must resolve representation conflicts, naming conflicts, format conflicts, etc. A representation conflict may involve objects that are identified through different attributes. For example, a field identified as email in a first schema may be identified as an address in a second schema. Naming conflicts may arise when records refer to the underlying entities in multiple ways or where the same name identifies different information. And, different formats or abbreviations may be used to identify the same underlying entities.
Some processes attempt to resolve these integration problems sequentially and independently, allowing errors to pass on uncorrected from one integrating step to the next. Other processes attempt to consolidate the data from all of the different sources into a single repository requiring extensive processing, scaling, searching, and large memories.