The present invention relates generally to databases and data warehouses, and more particularly to an elegant system and method for integrating and constructing a data warehouse or other data storage area.
In recent decades, as the computational and storage capacity of computer systems has improved, business record keeping has for the most part become automated. Thus, businesses have come to rely heavily on large-scale databases of sales, billing, payroll and other data, typically stored on magnetic storage media.
From the perspective of the information technology professional, such databases can conventionally be organized and maintained using variety of Database Management Systems (xe2x80x9cDBMSxe2x80x9d). Among such database systems, those adhering to a xe2x80x9crelationalxe2x80x9d model, Relational Database Management Systems (xe2x80x9cRDBMSxe2x80x9d), are likely the most popular.
A relational database is a collection of data that is organized in related two-dimensional tables of columns and rows. Data in a table can be accessed and manipulated by performing set operations on the tables, such as join, sort, merge, and so on. These operations are typically initiated by way of a user-defined query that is constructed in a query language such as Structured Query Language (SQL). SQL queries consist of high level commands which typically describe the data set to be selected, retrieved or processed.
While RDBMS databases clearly represent an advance over earlier manual record keeping, from an enterprise wide perspective, such databases are often not organized or designed with a view toward optimal efficiency. In particular, an enterprise will typically maintain and support several databases and record file stores, each respectively serving the particular needs of sales, billing and marketing departments, for example, or even of particular individuals. The information in these databases will generally overlap somewhat, resulting in duplication of effort to one degree or another. In addition, this decentralized, piecemeal collection of databases may render many applications, particularly applications involving enterprise-wide data relationships, difficult and cost ineffective to implement.
In an effort to address this problem, more and more businesses have turned to the creation of xe2x80x9cdata warehouses.xe2x80x9d These typically large-scale databases provide a storehouse for data, generally accumulated over years, and generated from a variety of database locations and applications. In the best case, such data warehouses are organized and maintained using a DBMS flexible enough to implement all applications of current interest, as well as to anticipate future applications.
Unfortunately, while data warehouses should result in efficiency savings over the long run, the design and creation of such databases is in general a time consuming, costly project. For example, because data must be integrated from a variety of database locations and applications, multiple data conversions must take place (e.g., of text, spreadsheet or tabular data). Each such conversion might require its own application software.
In addition, even where all source data is in a relational tabular (RDBMS) format, the selection of whole tables of data, perhaps consisting of millions of rows, for export to a data warehouse can be very time consuming and taxing of computational resources. In particular, while query languages allow the specification of data selection-using constructs similar to natural English, they do not allow developers to specify the actual procedure used to accomplish data selection. Thus, the procedure which actually implements the data selection may not be suitable for all applications (e.g. data warehouse creation). For example, an SQL selection of a stored table for export to a data warehouse would first result in the creation of the entire table in memory, possibly stretching the system""s computational resources to the limit.
In addition, depending on the desired characteristics of the data warehouse, source data bound for the data warehouse may need to first undergo transformation (e.g. division, combination, creation of new and complex mathematical relationships). Such transformation adds yet another step in the design process, and depending on its complexity, may require the application of programming languages requiring specialized knowledge (e.g., COBOL, C++).
Moreover, once data bound for a data warehouse is selected and optionally transformed, it must be integrated and stored with other data, all conformed to new access constructs consistent across the data warehouse. For example, if two tables, each with a customer Social Security number field, are bound for a data warehouse, one may use an xe2x80x9cSSNumxe2x80x9d naming convention, while another might use a xe2x80x9cSocial Securityxe2x80x9d naming convention to refer to the same data type. Ensuring such consistency may be time consuming, as it may involve: 1) selecting the first table using SQL; 2) changing field names; 3) storing the first table; 4) selecting the second table using SQL, and so on.
In short, what is needed is a cost effective, time efficient system and method for integrating and constructing a data warehouse or other data storage area in accordance with desired design characteristics.
Briefly, the present invention is directed toward technology that provides users with powerful tools necessary to manage and exploit data. In particular, the present invention is directed to a single xe2x80x9cdata pumpxe2x80x9d application environment, which allows developers to specify the import (selection), transformation and export of data to a desired destination location, all on a streaming, contiguous basis. As buffering allows the synchronization of the import and export steps, source data can be fetched, processed and exported a subset at a time, without the creation of whole sets of source data in memory. Thus, computational resources are spared.
The data pump processes may preferably be initiated by execution of instructions in pre-stored task modules. For example, execution of such instructions may cause selection (importation) of specified source data. In addition, the present invention preferably utilizes data interface drivers for each respective type of source data, thus allowing conversion to a relational (tabular) format. Therefore, selection instructions in task modules may consist of query language commands, such as SQL commands.
The data pump process may also execute instructions in pre-stored task modules to accomplish transformations of data in accordance with whatever degree of complexity is desired. Preferably, the task modules incorporate conventional language scripts, allowing specification of complex data transformations in conventional programming languages with a minimum of technical knowledge.
Thus, the present invention combines the ease of use of conventional, high-level query commands, with the control capability of more procedural programming languages.
Moreover, the present invention preferably incorporates easy to use user interfaces. These allow, for example, the intuitive specification of destination location (for export), as well as the correspondence of source and destination access constructs (e.g. of source and destination tabular column names). Thus, the present invention allows the specification of consistent access constructs across destination site data, when this characteristic is desired.
In addition, the present invention may also incorporate the creation of data lineage identifiers which track the task modules described above, or specific versions of such task modules. Thus, the lineage identifier can provide details about data pump transformations undergone by the data. For example, the lineage identifier can act as a pointer to detailed history files of operations that were performed on data and transformed it into its current form.
Preferably, a lineage identifier is added to each row in a table. The lineage identifier can then be used to trace the source of the data, i.e., data having a common identifier share a common history.
Further benefits and advantages of he present invention will be made more apparent in the more detailed description below.