The present invention relates to database management systems, and particularly to a method for loading rows into a database table while enforcing constraints.
A well known database software program is DATABASE 2 (DB2) database software distributed by IBM Corporation. As is known in the art, DB2 operates as a subsystem in a computer system operating under the IBM MVS operating system software. In a DB2 environment, user data resides in DB2 tables which are in tablespaces. A tablespace is, for example, a portion of storage space in a direct access storage device (DASD) such as a disk drive. For exemplary purposes, illustrated below is an order_entry table that would be stored in a tablespace. The order_entry table contains columns: customer_number; product_code; order_number; buyer_name; and ship_to_zip.
While the above order_entry table shows four rows, the table could have millions of rows for all the orders of a company, for example 4 million rows. The order_entry table also has, for example, three index keys and two foreign keys. An index key is an identifier for a particular row of a table while a foreign key also identifies a row but is also used for referential integrity as described below. For example, in the order_entry table, one index key could be based on order_number, another index key based on buyer_name and a third index key based on ship_to_zip. As is known in the art, an index key for a particular table indicates a row identification (RID) and a selected value for the row (e.g., the index key value). The index key can be used to generate an index for the table which facilitates subsequent searches for particular data in the table. For example, the order_entry table would have three indexes (e.g., one for each index key), each index being stored in an indexspace. Similar to a tablespace, an indexspace is, for example, a designated portion of a DASD. Thus, if a user was looking for rows that contain a particular buyer name in the order_entry table, the database management system could query the buyer index for the table to identify all occurrences of the buyer name without reading the entire table to locate the rows.
As is known in the art, each table in a database may be either a parent table, a child table or both. A child table is related to a parent table via the foreign key value or values contained in columns of the child table. For example, a foreign key value can appear multiple times in a child table (e.g., multiple rows in a child table can have the same foreign key, such as the customer_number and product_code entries in the order_entry table) but each foreign key must be associated with a unique key in a parent table of the child table.
Referential integrity ensures that every foreign key value is valid (e.g., has a corresponding primary key in a parent table). Thus, referential integrity (RI) means that a value in the column of a row in the table is valid when this value also exists in an index of another table. A row should not be in a table if it violates a constraint. As the order_entry table illustrated above has two foreign keys, it has for example, a RI constraint on customer_number and product_code. As is known in the art, when a user of a DB2 database management system creates a table, the user also defines the constraints for the table (e.g., the user can define the relational integrity criteria). Illustrated below is a product table and a customer table (e.g., the parent tables for the foreign keys in the order_entry table).
The product table show five rows, although the table could have thousands of rows for all of the different products of a company. The product table has a unique index on the column product_code, which is illustrated in ascending order. The values in the column product_code are each unique since there is only one product code assigned to each product and thus in this table, a product code would not be included more than once. Accordingly, an index for the product table would include the key (e.g., the stored value in the product_code column) and a RID. The product table index would reside in a DB2 indexspace.
The customer table illustrated below shows four rows, although this table could also have thousands of rows for all of the customers of a company. The customer table has a unique index on the column customer_number, illustrated in ascending order. The values in the column customer_number are each unique since there is only one customer number assigned to each customer name and thus a customer number would not be included in this table more than once. Accordingly, an index for the customer table would include the key (e.g., the value of the column customer_number) and a RID. The customer index would also reside in a DB2 indexspace.
As shown by the above tables, all of the rows in the order_entry table are valid because the foreign key values in the column product_code exist in the index of the product table and the values in the column customer_number exist in the index of the customer table.
Conventional database management systems, such as DB2, provide the user with the ability to identify specific conditions that a row must meet before it can be added to a table. These conditions are referred to as xe2x80x9cconstraintsxe2x80x9d because they constrain the values that a row may include. Constraints include, for example, check constraints and referential integrity constraints. Check constraints include, for example, qualifying criteria for a particular value, such as a zip code value being in the range of 00000 to 99999.
When a user needs to, for example, populate the order_entry table with data, there are two options. One option is that a DB2 application program can be written by the user to load the data. For example, for many operations to be performed on a table of a DB2 database, such as loading data, an application program can be written (e.g., coded in software) to make use of DB2-provided services via, for example, structured query language (SQL) INSERT statements. Such applications, however, present an additional burden on users of the database system as the applications have to be written for the desired function to be performed.
An application program that is inserting, for example, 4 million records into a table will cause the DB2 system to perform around 4 million input/output (I/O) operations to enforce any required referential integrity constraints, which is in addition to the 4 million I/Os required to load the data into the table initially. Therefore, as each I/O takes approximately 25 ms, a significant amount of time can be spent loading a large number of records using an application program. In tuning performance of any computer program, including an application program, the variables involved in achieving optimal performance include instruction path length (e.g., the number of instructions), memory (e.g., address space) and number of I/Os. Therefore, a reduction in the number of I/Os to be performed can be accomplished, for example, by increasing the amount of memory utilized by the program.
An alternative to application programs for loading data is a load utility, which is generally faster than an application program. For example, a load utility will enforce any referential integrity constraints (in contrast to the application program requiring the DB2 system enforce constraints), thus providing a faster load of data. While constraints help ensure that the data in a database is consistent and clean, constraint enforcement normally requires an I/O to each parent index for each row. This is especially a problem when attempting to load a large number of rows into a database. The most common way of dealing with this problem has been to load the rows without checking the constraints and, after all of the rows are loaded, going back and checking the constraints and deleting the rows which violate the constraints. For example, conventional LOAD utilities for loading large amounts of records into a relational database conventional database management systems do not perform any referential integrity constraint checking prior to loading the new records into a table.
FIG. 1 illustrates a prior art method for loading new records into a database and performing referential integrity (RI) checks, but as described below, the RI checks are performed after the new data has been loaded into the database. As shown in FIG. 1, in step 1000, data to be loaded into a table of a database is provided in an input file and a conventional LOAD utility, such as LOAD utilities provided by IBM Corporation, reads the input file and if an end of file (EOF) identifier is not read in step 1010, each record in the input file is written into the tablespace in step 1020. After each row is loaded into the table in step 1020, then in step 1030 the SORT input records are built for 10 the index keys and foreign keys and written to a temporary input file for the SORT utility and are temporarily stored in step 1040. When the index keys and foreign keys are extracted and the SORT input records are built, the LOAD utility provides, for example, an identifier with each input record that can be subsequently read by the LOAD utility to identify what the particular SORT record represents. Step 1030 loops back to step 1000 to read additional records of the input file. When an EOF indicator is read in step 1010, step 1050 reads the input file of extracted index and foreign keys from step 1040 and sorts the index and foreign keys into, for example, ascending or descending order via a conventional SORT utility. As is apparent, when step 1010 identifies an EOF condition, the table has been loaded with the data from the input file but no referential integrity constraints have been applied to the loaded data.
The SORT utility in step 1050 collates the key values for each of the indexes and foreign keys in a conventional manner and writes the sorted values to a temporary sort output file where 30 the values are temporarily stored in step 1060. Once the SORT process is complete, the index key values stored in step 1060 are read and are written to the appropriate indexspace in step 1080 (e.g., using the order_entry table example, the indexes for order_number, buyer_name and ship_to_zip would be written to the appropriate indexspace DASD in step 1080). In step 1090, the foreign key values stored in step 1060 are read and compared against the appropriate parent index. As is known in the art, when the foreign key values are passed to the SORT utility in step 1050, the identifier provided when the key was extracted is also provided that identifies the parent index to be used in the enforce constraint phase of the LOAD operation.
If a match is determined between the foreign key value and the parent index in step 1100, the loaded data is valid and the load process is completed for the particular row in step 1110.
If, however, there is no match between the foreign key value and the parent index in step 1100, then in step 1120, the row is deleted from the table and also from the corresponding entries in the indexes for the table. The process of deleting the row from the table includes, for example, reading the page in the tablespace that contains the invalid row, flagging the invalid row as deleted and rewriting the page into the tablespace. Similar action must be taken for deleting each invalid row in each index of the table.
Thus, the prior art approach to loading large amounts of data into a database includes a program that first loads all of the data into the table, reads each row of the loaded table, extracts the index keys and foreign keys involved in the referential integrity constraint check, sorts the extracted key values, writes the indexes for the table, reads the sorted foreign keys and compares them against the parent index and then deletes any invalid rows from the table as well as the indexes.
According to the present invention, data to be loaded into a table of a database is read from an input file by a LOAD utility including a CHECK subutility that performs constraint checking on each row prior to loading the row into the table. For example, each row from an input file is read by the LOAD utility and passed to the CHECK subutility which applies any applicable constraints against the row and provides a code to the LOAD utility indicating whether the row should be loaded or discarded. If the row is to be loaded, the LOAD utility writes the row into the tablespace and passes the index keys to a SORT utility so that the indexes for the loaded table can be written into indexspace.
For example, during initialization, the CHECK subutility according to the present invention can, for example, read the DB2 catalog to identify any constraints that apply to the data to be loaded. If any check constraints apply, check constraint routines can be built and compiled for evaluation by the CHECK subutility. If any referential integrity constraints apply, the key values for the parent index of each referential integrity constraint can be read and the key values stored in an additional dataspace allocated by the CHECK subutility so that the foreign key values from each row to be loaded can be extracted by the CHECK subutility and compared against the parent index values prior to loading the data into the table. When the constraint checking for the loading of new data is completed, the dataspace allocated by the LOAD utility to facilitate pre-load constraint checking is released.
Thus, the method for loading data into a database table according to an embodiment of the present invention eliminates writing unchecked and potentially invalid rows into a table, sorting and writing index keys of unchecked and potentially invalid rows, sorting of foreign keys, the need to delete invalid rows from a loaded table and the need to delete invalid rows from indexes.