This invention relates generally to a relational database management system (RDBMS). More specifically, the present invention relates to a system and method for the efficient bulk transfer of data into a relational database where in the input data is already in a computer's main memory.
A RDBMS is a computer database management system that uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables (formally, “relations”) are typically stored for use on disk drives or similar mass data stores. A “relation” includes a set of rows (formally, “tuples” or “records”) spanning several columns (formally, “attributes”). A “tuple” expresses a mathematical relation between its “attributes” (column elements), while a “record” does not. Reference is made to C. J. Date, An Introduction to Database Systems, 6th Ed., Addision-Wesley, Reading, M. A. (1994) for general treatment of the relational database art.
A well known database software program is DATABASE 2 (DB2) database software distributed by IBM Corporation. As is known in the art, DB2 operates as a subsystem in a computer system operating under the IBM MVS operating system software. In a DB2 environment, user data resides in DB2 tables with rows and columns, as described above. DB2 can manage multiple concurrent requests to manipulate the same table at the same time by tracking SQL read and write requests, and acquiring and releasing resource locks.
A RDBMS is structured to accept commands to store, retrieve and delete data using high-level query languages such as the structured query language, known as SQL. Standard SQL statements can be used to accomplish extensive database management operations across various platforms of currently available commercial database management systems. The SQL standard has been promulgated by the International Standards Association since 1986. Reference is made, for example, to the SQL-92 Standard “Database Language SQL” published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075:1992 for the official specification of the 1992 version of the Structured Query Language. SQL is the widely accepted language and interface to access relational databases, namely to define the database objects, to manipulate the data in the database objects, and to control access to the database objects. For example, these SQL statements or commands may be utilized to perform database management tasks such as the retrieval of data from a database and updating data in a database.
For example, SQL “query” statements refers to a set of user commands for retrieving data from a stored database. SQL is used to communicate queries to an RDBMS. SQL requires the return of a particular result set in response to a particular query, but the method of query execution (“Query Execution Plan”) employed by the RDBMS is not specified by the query. There are typically many different useful execution plans for any particular query, each of which returns the required result set. For large databases, the execution plan executed by the RDBMS to execute a query must provide the required data return at a reasonable cost and time and hardware resource. Almost all RDBMSs include a query optimizer to translate queries into an efficiently executable plan. Queries are submitted by users of the RDBMS.
Also, the database management task of loading data into a relational database, such as for adding and/or updating records, is of critical importance. There are two known methods of loading data into a relational database.
First, it has been well known in the prior art to command an application program to invoke an SQL INSERT statement which transfers one row of application data into a database table. The data is first transferred into variables or into memory structures before being loaded, i.e. “inserted” into the table. The SQL INSERT statements incurs overhead for the RDBMS because it must first log the changes made to the tables and then set up and communicate the SQL INSERT call to the relational database management system. Also, further overhead is incurred because the SQL INSERT command requires the RDBMS to manage concurrency by acquiring and releasing locks on the tablespace. The overhead from SQL INSERT statements can be considerable. For example, an application program that is inserting millions of records into a table can cause the DB2 system to perform corresponding millions of input/output I/Os required to load the data into the table initially. This requires a proportional number of log and space updates. In addition, referential integrity checks, indexing and logging of all of the changes incurs even further overhead. Therefore, each I/O can take a significant amount of time to load a large number of records using an application program. In tuning performance of any computer program, including an application program, the variables involved in achieving optimal performance include instruction path length (e.g., the number of instructions), memory (e.g., address space) and number of I/Os.
Secondly, it has also been known in the prior art to use a load utility to manage and control the loading of data into a relational database. This alternative method allows a program (herein termed the “load utility”) to take input data and insert it directly into the underlying structures that represent the tables within the relational database. This method can allow the invoker of the load utility to turn off logging and also to issue a replace operation of an entire table without having to first perform the deletion of the existing data. Such an operation with SQL INSERT calls can require a series of SQL DELETE calls prior to loading the new data. However, a single mass DELETE call is commonly used which suffers from logging and locking escalations which are proportional to the number of I/O operations.
Such use of a load utility requires the application to write data to a storage media that is subsequently used as input to the load utility. External media, such as an intermediate file, hard disk or tape, must be accessible to provide the input to the load utility. For the purposes of this discussion, any storage medium or media other than the computer's main memory is considered herein to be “external media”. As can be understood, this requires additional computer resources in the form the additional disk space, tape volumes or other type of external computer media. The use of a load utility further suffers from disadvantage of slowing down the bulk data transfer process because additional time is required for the application to write data to the intermediate file on the external media and for the load utility to subsequently read it.
In the prior art, query monitoring software is available that can collect statistics on each SQL statement executed for a given DB2 subsystem that is being monitored. These statistics are initially stored in several dataspaces, which are main memory areas that can only contain data and not programs. At the end of a timed interval or at the user's request, the dataspaces are transferred to DB2 tables. For long intervals or frequently accessed tables, this could result in the addition of several million rows to the database tables. Using SQL INSERT to add the statistics to the table could frequently require more elapsed time, CPU time or resource locks than allowed by the environment.
In view of the foregoing, there is a demand for a more efficient method to load large quantities of data into a database that is already contained in a computer's main memory. There is also a demand to reduce the elapsed time and CPU time used when adding bulk data to relational database tables. There also is a need to minimize the number of resources used when adding bulk data to relational database tables. There is a further demand to eliminate the need for additional input files on external media when the data is already contained in memory.