Vast amounts of information are contained within structured data sources, such as relational databases, XML documents, flat files, and other storage mechanisms. Generally, a user must understand the schema, or underlying structure and organization, to effectively query these data sources. For example, to effectively query a relational database, a user must know the name of the database, the names of all tables addressed by the query, and the names and data types of all attributes associated with the query. Likewise, when no XML schema is available, a user must extract the structure, attributes, and tags, to effectively query an XML document.
Though necessary, schemas are not sufficient for formulating meaningful queries. Users must also understand the meanings of data elements to extract productive information from the structured data. This presents a serious problem when dealing with multiple disparate data sources because naming conventions may vary significantly across the sources. Names may consist of terms or abbreviations specific to businesses or organizations, or merely be arbitrary identifiers incomprehensible by outsiders. In addition, identical names may carry different meanings in the context of different users. For example, the name “bureau” may mean drastically different things to a government contractor and a furniture supplier.
Security is yet another problem hampering access to structured data. In particular, database schemas may reveal sensitive information that an organization is unwilling to release. For businesses and organizations, databases and data repositories are critical resources that are tightly interconnected with other parts of their infrastructure. Even when some data could be made available for a wider audience and yield commercial or other benefits, allowing access to the data may pose substantial security risks and therefore seldom occurs.
Due to the aforementioned problems, viable options in the field have been limited to two principal approaches. The first is the “federated systems” approach, wherein several databases are integrated into one virtual database and their schemas are combined into a global schema for formulating queries that are programmatically translated into queries to specific databases. The approach presumes knowledge of all related database schemas to build a program for translating the queries. In addition, this federated systems approach is only practical for a relatively small number of databases due to lack of scalability. Adding a new database to a federated system generally requires updating the global schema and the translation program, as well as manual updates to incorporate the changes, both of which are costly and time consuming endeavors.
The second approach exploits “agents,” or computer programs using heuristics or artificial intelligence, for translating user queries into queries to physical databases. Agents, however, are similarly hampered in heterogeneous environments by the scalability and security issues discussed above, which limit their adoption for commercial and other pragmatic purposes. Thus, what is needed is a scalable and secure system and method for querying heterogeneous data sources that seamlessly integrates disparate data sources.