A data warehouse is a nonvolatile repository that houses an enormous amount of historical data rather than live or current data. The historical data can correspond to past transactional or operational information. Data warehousing and associated processing mechanisms (e.g., On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP), and Hybrid OLAP (HOLAP)) are widespread technologies employed to support business decisions and data analysis. Data warehouses are populated at regular intervals with data from one or more heterogeneous data sources, for example from multiple transactional or enterprise resource planning (ERP) systems. This aggregation of data provides a consolidated view of an organization from which valuable information can be derived. Though the sheer volume can be overwhelming, the organization of data can help ensure timely retrieval of useful information. For example, the organization of data in a data warehouse typically involves creation and employment of fact and dimension tables.
Standard data warehouse design methods require dimensions to include a surrogate key to join a dimension table with a fact table. A surrogate key is a deliberately meaningless, single part, unique identifier. There is no semantic value associated with the key rather its sole purpose is unique identification. The surrogate key is automatically generated by a database system upon inserting or loading a row in a dimension table, for example.
A conventional solution for generating these unique keys involves the use of auto-incrementing columns in the database or data warehouse. More specifically, the database or data warehouse designer defines an increment value for the column. When a new record is inserted into a table, the database stores a new value in the auto-incrementing column, which is greater than the previous row by the increment value and therefore guaranteed to be unique.
One of the problems of data warehouse development is that sometimes the surrogate key for a row needs to be known in advance in order to utilize that key in another operation. For example, if a document was received from a mainframe computer it might include information like a sales record and that sales record will typically include customer, product and transaction information. Furthermore, these rows can be very large including hundreds of columns. When this information is taken from a traditional system like a mainframe and loaded into a relational database system or some data warehouse system, typically the record will be spilt up into multiple tables such customer, product and order tables. Often a join needs to be executed on these tables together and the surrogate keys enable this to be done. However, in order to do this in a single process, the keys need to be known in advance, because the key values need to be inserted into the tables.
Many users try attempt to tackle this problem by inserting a key and immediately querying the table for the last key entered or the maximum value of the auto-incremented column. However, there are several problems with this approach.
First, the isolation of the database transaction may mean that the key value is not available until after the record is committed to the table. This requires the user to ensure that the current process is completed before they can use the new key value, which restricts their possibilities of using the new key to operations to be performed within the scope of the current transaction.
In addition, where many values are being inserted by parallel processes, it is not guaranteed that returning the last key generated will return the key for the expected record. For example, Process A may insert Record 1 and request the last key generated. However, between the insert and the request, Process B has inserted Record 2. Hence, the database returns the key for Record 2, which is not what Process A expects.