1. Field of the Invention
This invention relates generally to Relational Database Processing Systems, and in particular, to constraint checking and violation capture for bulk data stored in a relational database.
2. Description of the Related Art
A relational database management system (RDMS) uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables (xe2x80x9crelationsxe2x80x9d) are typically stored for use on disk drives or similar mass data stores. A xe2x80x9crelationxe2x80x9d includes a set of rows (xe2x80x9ctuplesxe2x80x9d or xe2x80x9crecordsxe2x80x9d) spanning one or more columns. A xe2x80x9crecordxe2x80x9d expresses a mathematical relation between its column elements. Reference is made to C. J. Date, An Introduction to Database Systems, vol. 1, 4th edition, Addison-Wesley Publishing Co. Reading Mass. (1986) for a description of a relational database management system.
An RDMS receives and executes commands to store, retrieve and delete data using high-level query languages such as the Structured Query Language (SQL). The term xe2x80x9cqueryxe2x80x9d means a set of commands for acting on data in a stored database. An SQL standard has been maintained by the International Standards Organization (ISO) since 1986. Reference is also made to the SQL-92 standard xe2x80x9cDatabase Language SQLxe2x80x9d published by the American National Standards Institute (ANSI) as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075:1992 for the official specification of the 1992 version of the Structured Query Language. See also James R. Groff et al. (LAN Times Guide to SQL, Osborne McGraw-Hill. Berkeley, Calif. 1994) for a description of SQL-92.
A table in an RDMS is partitioned into rows and columns such that there is one value at each intersection of a row and column. All of the values in a column are of the same data type. The only exception to this rule is that a value could be NULL. A NULL is a marker used to fill a place in a column where data is missing for some reason.
Tables are created explicitly by using the SQL CREATE TABLE command. A table may be created as xe2x80x9cpermanentxe2x80x9d, xe2x80x9ctemporaryxe2x80x9d, or xe2x80x9cvirtualxe2x80x9d. Permanent tables include the base tables that contain the fundamental data that is permanently stored in the database. Fundamental data refers to the data for which the database is created to manage in the first place for example, records of a group such as employees or students. Virtual tablesxe2x80x94also called xe2x80x9cviewsxe2x80x9dxe2x80x94are tables derived from base tables using queries. A view does not exist in the database as a stored set of values like a base table. Instead the rows and columns of data visible through the view are the query result produced by the query that defines the view. The definition of the view is stored in the database. Temporary tables are not permanently stored, but are used for handling intermediate results, Similar to program variables. Temporary tables are automatically flushed at the end of a working session. A table may be created, yet not have any data in it. Such a table, referred to as xe2x80x9cemptyxe2x80x9d, is typically created for receiving data at a later time.
xe2x80x9cConstraintsxe2x80x9d define conditions that data must meet to be entered into a permanent table of fundamental data. Constraints may apply to columns or to tables, they are checked by an RDMS. A constraint can be checked at any of the following times:
(1) after every statement that affects a table (e.g., after an INSERT query):
(2) at the end of a transaction executing one or more statements that affect a table: and
(3) at any time between 1 and 2.
Frequently, in large commercial database systems, data must be entered quickly and in bulk. Bulk-loading facilities, available for this purpose, load database tables at high speed from files outside an RDMS.
Because bulk-loading delivers massive amounts of data in a short amount of time, constraint checking can impose a severe bottleneck if not deferred until all of the data is loaded.
Even if deferred, constraint checking that must check each record one time for one constraint violation, flag the violation, and then check the same record again for each remaining constraint will consume a large amount of time, compounding the cost of bulk loading. Clearly, there is a need in the art for a utility that can check all constraints simultaneously for each record that requires checking in a given table to improve efficiency.
Recently, bulk loading tools have been provided which do not perform constraint checking. A table receiving bulk-loaded data is placed in a xe2x80x9cpendingxe2x80x9d state, meaning its data cannot he used until checked for constraints. What is needed is a tool for checking for constraints of such bulk-loaded data that can do so speedily and which also includes the capability of repairing such tables to remove violating records.
Referential Integrity
In any tool that performs constraint checking of bulk-loaded data the problem of ensuring that no constraints are violated is complicated by the need to ensure xe2x80x9creferential integrityxe2x80x9d at the database. Referential integrity ensures soundness of an entire database. Relatedly, consider the example of an employee database with a table that groups employees by department and a table that contains all possible departments of an employing organization. In this case, the table of employees would include a column representing the respective employees"" department numbers. The employee department number value is a xe2x80x9cforeign keyxe2x80x9d that references an unique identifying column in the table containing all the departments in an employing organization. The second table, in this case, is the xe2x80x9cparent tablexe2x80x9d. The unique identifying column in the parent table identifying department titles is referred to as a xe2x80x9cprimary keyxe2x80x9d. Referential integrity is the state when all foreign key values are present in their parent keys. If an employee""s department is eliminated and its corresponding record is deleted from the parent table, then the foreign key in the employee records representing that department, is invalid. In such a case, the system would lack referential integrity. Although the above simple example shows a foreign and primary key having only one column referential integrity can be assured using multi-column keys.
In the above example, the record for the employee having no department is said to be xe2x80x9corphanedxe2x80x9d because the foreign key has no parent table. A typical SQL technique for dealing with orphans is to eliminate them when their parent references are eliminated. A function known as CASCADE is available in SQL for ensuring that records having foreign keys are eliminated when their referenced primary keys are eliminated.
Table Check Constraints
Valid data within the table can be enforced through table check constraints. Table check constraints specify search conditions that are enforced for each row of a table.
A table check constraint can be used for validation. Examples of some constraints can be: the values of a department number must lie within the range 10 to 100; the job title of an employee can only be xe2x80x9cSalesxe2x80x9d, xe2x80x9cManagerxe2x80x9d, or xe2x80x9cClerkxe2x80x9d; or an employee who has been with the company for more than 8 years must earn more than $40,500.
What is needed is an efficient tool and method for checking constraints that includes referential integrity and table check constraints (for bulk-loaded data in a pending table that includes a violation capture mechanism).