Data relevant to a given query may be stored across many different types of databases, for example, triple store, relational (SQL store), or cloud databases (e.g. Hadoop, Cloudbase, HBase). However, searching across multiple types of large scale, heterogeneous databases, poses a variety of technical and scientific challenges.
For example, in traditional extract transform load (“ETL”) approaches, the data in each database is duplicated and converted to a common model, which poses a significant challenge at the petabyte-scale. Additionally, synchronization issues may arise at larger scale and BASE semantics.
The differences in data formats, granularities, schemas, and distributions are the biggest challenge to data integration. Data sources are almost always different in their structural models and representation, but may also be different in their coverage, granularity, perspective, and terminology. To complicate matters further, different communities may use the same schema in different ways (semiotic heterogeneity). Additionally, in traditional ETL approaches, if data sources do not align properly, any impedance mismatch between two data models is baked into the transformed data.
These challenges are only magnified at scale. Traditional ETL approaches to data integration and fusion fail for cloud scale data. The sheer scale of the data makes it impractical to convert and redundantly store it for the purpose of querying.
Due to the deficiency of the prior art, there exists a need for a software middleware component that mediates between multiple data models and allows queries to be performed against large scale, heterogeneous databases.