In many industry applications, a large amount of data is generated, collected, and deposited in a database. For example, in the oil and gas industry, geological formation data may be collected during the oil exploration stage. This geological formation data is stored and used for determining the locations where drilling should take place. Further, during the oil production stage, data regarding oil and gas flow rates may be collected to monitor oil production processes. The aforementioned data may then be used to optimize the oil production processes. Regardless of the data's source and/or use, a large investment is often needed to build a database to store the aforementioned data. The databases are typically built with the goal that that all the data is stored and managed such that subsequent search and use of the data is optimal.
The market of local Database Management Systems (DBMS) is dominated by client-server products called SQL Servers, developed by Sybase and Microsoft. SQL Servers are based on the Structural Query Language (SQL). SQL Servers are popular in low-end applications in small businesses and local offices where the local DBMS run on stand-alone desktops and laptops. This is because SQL Servers are of low cost, simple administration, and good performance, as a result of being based on the popular Windows® NT technology (Windows NT is a registered trademark of the Microsoft Corporation, USA). On the other hand, Oracle® (Oracle is a registered trademark of Oracle Corporation, USA) dominates the high-end database market (such as corporate and national repositories) because of its high scalability, reliability, and a wide assortment of features. Oracle® and SQL Servers have many differences, and it is often difficult to support applications on both Oracle and SQL Server. Software vendors (and hence users) are often forced to choose one platform over the other. Migrating data from one platform to another and merging data may require extensive effort. Further, a customized DBMS is often required to run on different platforms.
In the following context, a database instance and a DBMS are collectively referred to as a “data repository”. A database instance can have many data stores. A data store contains a collection of tables and views. A table is an array of data. A view is a particular way of looking at a table, and the change of a view does not affect the physical organization of a table. Users are typically not granted access to the tables directly. Rather, they are granted access to views of the tables. A view in a data store provides a standard interface to that data store. Tables and views can be private or public. Conventionally, for a database entity XYZ, a private implementation of a table is represented with XYZ_, and a public interface view is represented with XYZ. The data in a data repository are usually related among themselves, and such a relation is usually defined by a logical specification in the form of a “data model.”
Basic operations involving a database include: (1) search, which allows a user to specify search conditions such that the retrieved set of data matches the search conditions; (2) insertion, by which a new set of data with given attributes are inserted into a table; (3) updating, which changes the attributes of a selected data set from a table into new values; and (4) deletion, which delete a data set from a table.
A major part of data management is loading new data into the data repository. The new data may correspond to updated versions of existing data in the database or the new data may correspond to data that is not current stored in the data base. In the form case, the new data is used to populate an existing entity in the database. In contrast, in the later case, a new entity is created in the database and the new data is then used to populate the new entity. It is important to set up rules for deciding whether the incoming data matches existing entities (i.e., is associated with an existing entity). One of the purposes for the aforementioned results is to ensure that the new data have mandatory values and that referential integrity is satisfactory.
In the case of an violation of the rules, the DBMS should have options to completely reject the new data, or to allow partial loading of a consistent subset of the new data. If a violation causes a complete abortion of a transaction, it is potentially dangerous because the database is in an inconsistent state unless all the previously rejected data can be reloaded in a single subsequent transaction. On the other hand, if referential integrity is enforced while loading data, the new data must be provided to the database in the right order. In some cases, the new data includes cyclic references and, thus, there is no way to keep insert the new data in the right order without modifying the manner in which the new data is entered. In such cases, the manner in which new data must be entered into the database is to break the cycles.
As discussed above, there are cases in which the DMBS detects an error in the new data, DMBS typically have a mechanism to detect and report errors. One common mechanism to detect and report errors is to detect errors one at a time as they occur. If there are many errors in the new data, then the aforementioned mechanism makes it is difficult to discover and fix systematic errors. Further, the aforementioned mechanism forces the user t to decide whether to repeatedly fix newly discovered errors and try again or to reject the new data in its entirety.