The present application relates to data warehouses, and more particularly to techniques for transforming and loading information into a data warehouse in an efficient manner.
ETL (data Extraction, Transformation, and Loading) processes are commonly used to extract data from data sources, perform transformations on the extracted data, and to load the transformed data into one or more target data stores. Various ETL tools are provided for performing ETL processes including ETL tools from Informatica. The data sources may be databases (such as Oracle databases), applications, flat files, and the like. The target data store may include databases, flat files, and the like. In a typical ETL process, desired data is identified from the data sources and extracted from one or more data sources. The extracted data is then transformed and then the transformed data is loaded into the target data store. Data transforms are often the most complex and, in terms of processing time, the most costly part of an ETL process. The transformations can range from simple data conversions to extremely complex transformations.
ETL processes are commonly used where the data sources are online transaction processing (OLTP) systems and the target data store is a data warehouse. For example, the target data store may be a sales analysis data warehouse and a data source might be an order entry system that records all of the current order activities. A data warehouse is a data store that is designed for query and analysis rather than for transaction processing. A data warehouse usually stores historical data derived from transaction data, it can also include data from other sources. Data warehouses are designed to accommodate ad hoc queries and data analysis and enable separation of analysis workload from transaction workload. A data warehouse enables an organization to consolidate data from various sources.
Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query and analytical performance. A schema is a collection of database objects, including tables, views, indexes, and synonyms. A star schema is a simple and commonly used schema for storing data in a data warehouse. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star comprises one or more fact tables and the points of the star are dimension tables associated with the fact tables. A typical fact table comprises fact measures and a set of keys (also referred to as foreign keys). The keys enable the fact table to be joined with its associated dimension tables for queries. Accordingly, the dimension tables associated with a fact table are related to the fact table through the foreign keys included in the fact table. For example, a fact table may be a SALES table having associated dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc. The SALES fact table may store measures quantity_sold, amount, and cost, and store foreign keys time_ID, product_ID, region_ID, and salesperson_ID that enable the fact table to be joined with the dimension tables.
ETL processes may be used to extract, transform, and load data into a data warehouse based upon a star schema. As part of the process, the facts data may be extracted from one or more data sources and stored in a temporary fact-staging table. The records in the temporary fact-staging table may then be transformed and the transformed records may then be loaded into the fact table (the “final” fact table) in the data warehouse.
As part of the transformation process, foreign keys (or dimension table unique identifiers) for the dimension tables associated with a fact table that enable the fact table to be joined with the dimension tables need to be added to each of the records stored in the fact-staging table. This is typically done using a lookup transformation (such as the lookup transformation provided by Informatica and other ETL vendors) which is used to lookup values from a relational table/view or a flat file. During workflow execution, the lookup transformation creates a lookup cache (usually file-based) and loads into the cache all the records from a dimension table associated with the fact table. The lookup transformation then takes the records from the fact-staging table, matches them to data in the lookup cache, and determines foreign keys (unique identifiers) from the dimension table that are to be added to the fact-staging table records. Once this has been done for all the dimension tables associated with the fact tables, the fact-staging table records, now including foreign keys corresponding to the different dimension tables, are loaded into the final fact table in the data warehouse.
Due to the use of lookup transformations, in effect, each record from the fact-staging table has to pass through the lookup transformations for all the dimension tables prior to the records being loaded into the final fact table. Since there could be millions of fact records in the fact staging table, the use of the lookup transformation significantly impacts the ETL process. While the time needed for performing the transformations using the lookup transformations may be acceptable for narrow fact tables (i.e., fact tables associated with a small number of dimension tables, for example, 10 to 12 dimension tables), processing is adversely impacted for wider fact tables (i.e., fact tables associated with a large number of dimensions, for example, over 50 or 60 dimension tables). This is because each record from the fact-staging table is routed through a large number of lookup transformations corresponding to the large number of dimension tables. Further, since the lookup transformation does file system based searching of unique identifiers in the dimension tables, the process does not scale with a large number of dimensions. For example, transforming and loading wide fact tables can have a throughput as low as 266 RPS (rows per second). Loading one such fact table with 100 million records can take over 4 days to process.
The design and implementation of an ETL process is an important part of any Business Intelligence (BI) solution. The performance of an ETL process is one of the key factors that determine the success or failure of a BI implementation. Techniques are thus desired that improve the processing time needed for transforming and loading data into a data warehouse.