1. Field of the Invention
This invention relates to an architecture for providing transparent access to heterogeneous XML data over a wide area network such at the Internet.
2. Description of Related Art
A vast amount of information currently accessible over the Web: and in corporate networks, is stored in databases. For example, in a service provider company like AT&T, there are a large number of useful databases containing information about customers, services (e.g., ordering, provisioning, and billing databases), and the infrastructure used to support these services. Similarly, there are a huge number of publicly accessible datasets over the Web on a variety of subjects, such as genetics, which provide information about nucleotide sequences, gene expression data, reference sequences, etc.; and finance, which provide historical and current information about the stock market, periodic statistics about the economy and the labor markets, etc. To retrieve information from these databases, users typically have to: (i) identify databases that are relevant to their task at hand, and (ii) separately issue queries against each of these databases. This process can be quite onerous, especially since there are many different data representations and query mechanisms: legacy databases (for example, IBM's IMS) and specialized formats (e.g., ISO's ASN.1) abound. The task of identifying relevant databases is made more challenging by the dynamic nature of this collection of databases: new databases appear often, database schemas evolve, and databases (just like Web sites) disappear. In this environment, the goal of providing a mechanism to issue declarative, ad hoc queries against this dynamic collection of heterogeneous databases, and receive timely answers, remains elusive. The recent trend of publishing and exchanging a wide variety of data as XML begins to alleviate the issue of heterogeneity of databases/datasets accessible over the Web and in corporate networks. One can rep-resent the contents of relational databases and legacy IMS databases as XML, just as datasets represented in the ASN.1 format can be transformed, into XML. XML query mechanisms, such as XPath and XQuery can then be used for uniformly posing queries against these databases, without the user having to know the specific data representation and query mechanism used natively by these databases. Several approaches proposed in the literature provide partial solutions to the problems of identifying relevant databases, and locating desired query answers, these solutions include:
Traditional data integration technology seeks to provide a single integrated view over a collection of (typically domain-specific) databases. Applications of this technology include systems for querying multiple flight databases, movie databases, etc. The principal advantages of this approach include the ability to pose declarative, ad hoc queries without needing to know the specific backend databases, and being able to receive timely answers. However, considerable manual effort is involved in the task of schema integration, and it is not obvious how this technology can be used in a flexible way over a dynamic collection of heterogeneous databases.
Conventional Web search engine technology over static documents is based on spidering the documents, building a single site index over these documents, and answering keyword queries for document location using this index. For this technology to be potentially applicable to our problem, the contents of these databases would need to be published as static Web documents, which is rarely feasible. The principal advantage of this approach is the ability to pose (simple, keyword) queries against a dynamic collection of heterogeneous databases, without explicit schema integration. However, timeliness issues (search engine indices tend to be out of date), the lack of query expressiveness, and the unfeasibility of publishing database contents as static Web documents, make this approach less than ideal for querying frequently updated databases.
Recent peer-to-peer (see, e.g., and data grid technologies have proven useful in locating replicas of files (e.g., music files) and datasets (e.g., in collaborative science and engineering applications), specified by name, using centralized and distributed metadata catalogs. The principal advantage of applying these technologies for our problem is the ability to locate and access information in a timely fashion. However, it is not clear how these technologies could be used to provide declarative query access over a dynamic set of databases.
The present invention provides a solution to the problem of issuing declarative, ad hoc queries against such a dynamic collection of XML databases, and receiving timely answers. Specifically the present invention provides decentralized architectures, under the open and the agreement cooperation models between a set of sites, for processing XPath queries and updates to XML data. The architectures of the present invention model each site as consisting of XML data nodes (which export their data as XML, and also pose XPath queries) and one XML router node (which manages the query and update interactions between sites). The architectures differ in the degree of knowledge individual router nodes have about data nodes containing specific XML data. The system and method of the present invention further develop the internal organization and the routing protocols for the XML router nodes, that enable scalable XPath query and update processing in these decentralized architectures. Since router nodes tend to be memory constrained, and the routing states maintained at the router nodes are storage intensive, we facilitate a space/performance tradeoff by permitting aggregated routing states, and developing algorithms for generating and using such aggregated information at router nodes. The present invention experimentally compares the scalability of our architectures and the performance of the query and update protocols, using a detailed simulation model, varying key design parameters.