The invention relates generally to computer database systems and more particularly to processing referential integrity constraint violations resulting in cascading deletes during database load operations. The subject matter of the invention is generally related to the following jointly owned and co-pending patent application: “Constraint Processing” by Christopher Y. Blaicher, Kerry C. Tenberg and Randol K. Bright (Ser. No. 10/871,160) which is incorporated herein by reference in its entirety.
Virtually all modern DataBase Management Systems (“DBMS”) provide mechanisms that permit users to constrain the value of one database entity based on the value or existence of another database entity. One common constraint type is the referential constraint. Referential constraints require that a value referred to by one database entity is associated with an existing entity in the database. In the context of the Structured Query Language (“SQL”), referential constraints are implemented through the use of Foreign Keys (“FK”), wherein a database entity's FK value must equate to the Primary Key (“PK”) value of another, existing, database entity.
In general, constraint processing is preformed during database update and load operations and may be handled in accordance with one of three policies. In the first, deletion of a referenced entity is prohibited. This policy is often referred to as the “Reject Violating Modifications” policy. In the second, if a referenced entity is deleted or determined to be invalid then all entities that reference it are also deleted (or marked invalid). This policy is often referred to as the “Cascading” policy. In the third, FK values referencing a deleted or invalid PK value are set to NULL. This policy is often referred to as the “Set-Null” policy.
Consider, for example, a relational database entity (a table) such as that shown in Table 1. Under the Reject Violating Modifications policy, the entry associated with Employee-1 (row 1) could not be deleted because at least one other entry's FK references the entry. In the example of Table 1, in fact, two other entries (rows 2 and 3) are referentially constrained to the entry associated with Employee-1. Under the Cascading policy, if the entry associated with Employee-1 (row 1) was deleted or determined to be invalid during a load operation, those entries in rows 2 and 3 would become invalid because their FK would no longer refer to an existing/valid entry. Once the entry in row 2 is invalid, the entries in rows 4 and 5 become invalid because they too would have FK values associated with a non-existing or invalid PK value. Similarly, those entries in rows 77 and 97 become invalid. Thus, under the second policy above, only row 6 remains valid after deletion of the entry in row 1. Under the Set-Null policy, all illustrated entries except that of row 6 would have their FK value set to NULL.
TABLE 1Illustrative Database Table EntityEmployee IDManager IDOtherRow(PK)(FK)Attributes111Data221Data331Data442Data552Data623 47 Data    77 37 5Data    97 47 5Data    
The situation in which one row (through failure to load or becoming invalid or deleted) causes a chain of one or more additional rows to become invalid is known as the “cascading delete” problem. It will be recognized by those of ordinary skill that the problem of cascading deletes only occur in self-referencing tables (see FIG. 2A) or when two or more tables are related through referential constraints to form a cycle (see FIG. 2B).
As illustrated by Table 1, both the Cascading and Set-Null policies require that when an entry is marked for deletion, or as invalid during a load operation, the effect of this action must be checked against all other entries in the affected entities (e.g., table or tables). It is for this reason that referential constraint checking can consume a larger amount of time, particularly during database load operations and especially for large and/or heavily constrained entities.
In the context of a relational database system and referential constraint processing in accordance with a Cascading policy, FIG. 1 shows prior art load-time referential integrity check operation 100 proceeds generally as follows. A first row of data is obtained (block 105) and its data is checked for validity. If all of the row's data is valid (the “Yes” prong of block 110), the data row is loaded into a table (block 115). If all of the data is not valid (the “No” prong of block 110), the row is not loaded into the table (block 120). In a modern DBMS such as DB2, a row rejected as invalid at this stage of the load process is also identified in a system log entry. Following the acts of blocks 115 and 120, if additional data remains to be checked/verified (the “Yes” prong of block 125), processing continues at block 110. If no additional data remains (the “No” prong of block 125), referential integrity check processing begins (block 130-155).
During referential constraint processing a first valid row is obtained from the loaded table (block 130) and a check is made to determine if its FK corresponds to a PK in a valid entry. If the entry's FK does not correspond to a valid PK entry, meaning the entry's referential constraint is violated (the “Yes” prong of block 135), the row is marked for deletion (block 140) and the entire table is checked to see if this change further invalidates additional row entries (block 145). Typically, as each new invalid row is identified (the “No” prong of block 150), it is marked for deletion (block 140). As illustrated in FIG. 1, the acts of blocks 140 through 150 are recursive in nature such that each entry (data row) marked as invalid or for deletion causes the entire table (or an index thereon) to be searched. Once all entries in the table have been checked (the “Yes” prong of block 150), a check is made to determine if additional constraints need to be checked. If additional constraints need to be verified (the “Yes” prong of block 155), processing resumes at block 135 where the particular constraint is checked. If all referential constraints have been checked (the “No” prong of block 155), a check is made to determine if additional data rows in the table remain to be checked. If additional data rows remain to be checked (the “Yes” prong of block 160), processing continues at block 130. If all valid rows have been checked (the “No” prong of block 160), the table is updated to remove all rows designated for deletion and made available to users of the DBMS (block 165).
One drawback to prior art techniques such as process 100 is the time required to process cascading deletes. As illustrated by blocks 140 and 145, each time a row is determined to be invalid (marked for deletion), the table is checked to determine what effect this has on other rows. Accordingly, the total time to load a table T(total) is the time required to physically load the data T(data) plus the time required to check a referential integrity constraint T(ref) multiplied by a function that depends upon the number of referential constraints (N), processed f(N):T(total)=T(data)+[T(ref)×ƒ(N)].  EQ. 1In an operational DB2 database system, it has been found that the function f(N) is typically a low-order polynomial. For example, if the value of N is doubled, the value of f(N) may increase by a factor of 1.5 to 1.8.
It is clear from FIG. 1 that in situations where one invalid row causes additional rows to become invalid through failure to satisfy a referential constraint (creating “cascading deletes”), the prior art makes multiple passes through the table data (see discussion above regarding blocks 140-150)—requiring a processing time given by the [T(ref)×ƒ(N)] term in EQ. 1. As the time to process even a single referential constraint in this manner can significantly increase a table's load time (especially for large tables and/or tables exhibiting highly “nested” or “layered” referential constraint relationships), it would be beneficial to provide methods and devices to efficiently load database objects in the face of cascading deletes.