Due to the increased amounts of data being stored and processed today, operational databases are constructed, categorized, and formatted for operational efficiency (e.g., throughput, processing speed, and storage capacity). Unfortunately, the raw data found in these operational databases often exist as rows and columns of numbers and code which appear bewildering and incomprehensible to business analysts and decision makers. Furthermore, the scope and vastness of the raw data stored in modern databases render it harder to analyze. Hence, applications were developed in an effort to help interpret, analyze, and compile the data so that a business analyst may readily understand it. This is accomplished by mapping, sorting, and summarizing the raw data before it is presented for display. Thereby, individuals can now interpret the data and make key decisions based thereon.
Extracting raw data from one or more operational databases and transforming it into useful information is the function of data “warehouses” and data “marts.” In data warehouses and data marts, the data are structured to satisfy decision support roles. Before the data are loaded into the target data warehouse or data mart, the corresponding source data from an operational database are filtered to remove extraneous and erroneous records; cryptic and conflicting codes are resolved; raw data are translated into something more meaningful; and summary data that are useful for decision support, trend analysis or other end-user needs are pre-calculated.
The data warehouse is comprised of an analytical database containing data useful for decision support. The warehouse contains data from multiple sources and formats. Data are extracted from the sources, transformed as needed, and mapped into the warehouse. A data mart is similar to a data warehouse, except that it contains a subset of corporate data for a single aspect of business, such as finance, sales, inventory, or human resources. With data warehouses and data marts, useful information is retained at the disposal of the decision-makers.
One major difficulty associated with implementing data warehouses and data marts relates to that of transporting data, non-invasively and in a timely manner, from the operational databases to the data warehouses and/or data marts. As new transactions occur, vast amounts of new data are generated and added to the operational databases. If the new data are not transported to the data warehouse/mart databases by the time of analysis, these databases are out of sync with the operational databases. Consequently, the data within the data warehouses/marts lose their relevance for the analyses used in support of the decision-makers.
To maintain the relevance of the decision-making analyses, and to quickly capture the rich data patterns and information contained in the operational databases, frequent refreshes of the data warehouses/marts are preferred. However, operational databases are very large to begin with, and they can rapidly grow larger as new data are accumulated. As a result, data transport processes (data extraction, transformation, and loading) can consume a significant amount of system resources and take a long time to complete. Thus, it is desirable to find approaches for non-invasive data transport that can increase the throughput and speed of the data transporting process.
Prior Art FIG. 1 is a block diagram depicting the flow of data in accordance with a prior art data warehousing application. The model used to create a data mart capable of handling complex decision support queries is known as multi-dimensional modeling; one type of multi-dimensional model is called the “star schema.” A star schema is characterized by two types of tables: fact tables, and dimension tables. In a data warehousing application, a fact table is where numerical measurements of a business are stored, taken at the intersection of the dimensions from one or more dimension tables. For example, a fact table may include sales in dollars, number of units sold, total price, and the like. Dimension tables store the descriptions or characteristics of the business; dimension tables contain the descriptors of the facts. For example, product, customer, region, or time could be used as dimensions. A dimension table usually contains a primary key, and a fact table contains the foreign key.
With reference still to FIG. 1, for many applications, in particular electronic business (“e-business”) applications, a slowly changing dimension table (first target database 18) and an aggregate fact table (second target database 28) are both populated with information from a same source (operational database 10). Currently, this is accomplished by first populating dimension table 18 with one target load order group (TLOG) (e.g., pipeline 1) and then populating fact table 28 with another TLOG (e.g., pipeline 2).
In pipeline 1, operational data are read from operational database 10, and the data are passed through aggregator 12, lookup 14 and filter 16 to identify and filter out duplicates of previously read data already stored in dimension table 18. Instances of data that are not duplicates are then added to dimension table 18. In pipeline 2, operational data are read from operational database 10, the operational data are passed through lookup 24 and aggregator 26 to transform the data into a format useful for decision support.
An example will be used to provide a further description of pipelines 1 and 2. From a previous execution of a pipeline, dimension table 18 will contain customer names and a unique identifier (a “customer ID”) associated with each name. Lookup cache 15a is built based on the information in dimension table 18 before execution of pipeline 1; specifically, lookup cache 15a is built from a persisted cache file from dimension table 18. Thus, lookup cache 15a will contain those customer names and/or the customer IDs already in dimension table 18 from the previous execution of the pipeline, before execution of pipeline 1.
When new transactions (e.g., customer purchases) occur, new data will be added to operational database 10. In fact, a customer may make several purchases, and so the customer's name may appear several times in operational database 10. Pipelines 1 and 2 therefore need to be executed in order to update dimension table 18 and fact table 28, respectively, with the new data in operational database 10.
If the customer's name and associated customer ID are not already in dimension table 18, then it is necessary to assign a customer ID and add the name and the ID to dimension table 18. However, if the customer's name and ID are already in dimension table 18, then this is not necessary. In addition, it is not necessary or desirable to assign a customer ID to each of the multiple instances in which the customer's name appears in operational database 10.
Before updating dimension table 18, aggregator 12 functions to identify and combine duplicate instances of customer names in operational database 10. Lookup 14 compares the output from aggregator 12 against lookup cache 15a to identify new customer names. If the customer name appears in lookup cache 15a, then it is not a new name; in this case, filter 16 filters out the name so that it is not added to dimension table 18 (i.e., filter 16 filters out the rows in which the lookup value was found). If the name does not appear in lookup cache 15a, then it is a new name; in this case, the name is forwarded to dimension table 18. Sequence generator 17 then functions to automatically generate a customer ID (e.g., a primary key) for each new customer name.
To populate fact table 28 in pipeline 2, operational database 10 is read again. To get the customer IDs needed for fact table 28, a new lookup cache 15b is built by reading from dimension table 18 before pipeline 2 executes. Lookup cache 15a cannot be reused because, after it was built, dimension table 18 was updated with new customer names and IDs. Thus, dimension table 18 contains more recent information that is not contained in lookup cache 15a. Lookup cache 15b is built after execution of pipeline 1 but before the execution of pipeline 2, and thus will contain the information added to dimension table 18 in pipeline 1.
Lookup 24 reads the customer IDs from lookup cache 15b, and aggregator 26 calculates the data for fact table 28. For example, aggregator 26 may calculate the total sales per customer. In this case, fact table 28 would contain the customer IDs and the total sales associated with each customer ID.
The prior art is problematic because, as described above, after the execution of pipeline 1 the dimension table 18 will contain more recent information (e.g., new customer name and customer IDs) than that contained in lookup cache 15a. As a result, it is necessary to re-initialize lookup cache 15a (that is, build lookup cache 15b) before populating fact table 28 (before executing pipeline 2). That is, any updates to the dimension table require that the lookup cache be rebuilt. Caches 15a and 15b can be very large (often on the order of two gigabytes each), and so rebuilding the lookup cache can consume valuable processing resources (e.g., computer resources such as processor cycles and memory), and can also decrease the speed at which data are transported and processed, thereby decreasing data throughput.
Another problem with the prior art is that operational database 10 is read twice, first to populate dimension table 18 in pipeline 1, and then to populate fact table 28 in pipeline 2. As described above, operational database 10 is very large (often on the order of 25 gigabytes), and so it can take several hours to read. Reading operational database 10 more than once also consumes valuable processing resources and decreases the speed at which data are transported and processed, decreasing data throughput. Additionally, it can impact negatively on the throughput of the transaction operating against the operational database.
Potential solutions for addressing these problems are complicated by the use of different processes in different portions of the pipeline. For example, in the PowerMart Suite by Informatica of Palo Alto, Calif., a Data Transformation Manager (DTM) manages one portion of pipeline 1, and a Writer process is launched for another portion of pipeline 1, as illustrated in FIG. 1.
Accordingly, what is needed is a method and/or apparatus that can increase the speed at which data are transported and processed, and reduce the load on processing resources. What is also needed is a method and/or apparatus that can satisfy the above needs and that can be adapted for use in a data warehouse system. The present invention provides a method and