A common scenario in scientific computing is to run a complex computation across multiple heterogeneous environments, accessing and processing data stored in a variety of different formats. For example, one complex computation is to search for a DNA match of genetic material within various databases of genetic sequences and their citations. Heterogeneous environments contain different, and often incompatible, combinations of hardware, operating system, and application software. Typically, these different systems are purchased independently to serve a particular need, and enterprises may have information spread across multiple computer systems. Since system vendors attempt to provide a competitive edge over the offerings of their competitors, the different systems are almost by definition incompatible. Even relational database management systems (DBMS) based on the Structured Query Language (SQL) standards and the relational model can be incompatible due to differences in SQL implementations, database definition, and communication mechanisms. Hence, the task of combining heterogeneous systems for executing complex computations is difficult.
In executing complex computations over multiple, heterogeneous environments, a database management system is often used as the ultimate repository of the results of the computation, as well as the source of some of the data used in the computation. However, integrating yet another heterogeneous system, i.e., the database management system, into the environment presents additional complications to the already difficult task of executing the complex computation.
For example, one approach for managing such complex computations is known as multi-stage processing, in which the complex computation is broken up into multiple steps, each step submitted by the user and taking place in its entirety and independently on a homogeneous platform. Results of each computation are combined in a separate step and loaded into the results database. The coordination of the computation steps is either done manually or using some form of a workflow engine, and the complete computational results are only available after being executed in batch mode, not interactively. Thus, in the multi-stage processing approach, integration with the database system is poor and non-interactive, essentially reducing the database system to an after-the-fact user interface to results of previously executed runs of data. Furthermore, coordination of the execution the computations is difficult to achieve in a robust and scalable manner.
A more interactive approach is to code the specific coordination steps in a database query that can be submitted by a user. When the user submits the specifically coded query to the database system, the query is executed, causing the database system to initiate the various sub-components of the complex computation and combine the results together. Coding the query in this approach requires that the developer have an intimate knowledge of each heterogeneous system, the parts of the computation are to be executed upon and manage the execution of the parts outside the database. This approach is not general in scope and only addresses a specific computation.
Another approach is to leverage the parallel query, clustering, and two-phase commit mechanisms of some database systems that allow a single query to be executed across multiple (homogeneous) database instances. In this approach, the data present on an incompatible system is migrated into one or more of the database and the algorithms used to perform the computation are rewritten to function within the database. However, this migrate-and-rewrite approach is expensive and time-consuming and requires investment in hardware so that each node is capable of executing a database instance, database management system software as well as database administration and application development resources. In addition, it may not be feasible to rewrite the algorithms to be executed in the database.
Therefore, there is a need for a robust, scalable, interactive, and inexpensive approach for performing complex computation across multiple heterogeneous database systems.