The present application relates to techniques for extracting and storing hierarchy data, and more particularly to techniques for extracting hierarchy data from a data store, flattening the hierarchy data, and storing the flattened hierarchy data in a target data store.
ETL (data Extraction, Transformation, and Loading) processes are commonly used to extract data from data sources, perform transformations on the extracted data, and to load the transformed data into one or more target data stores. Various ETL tools may be provided as part of an ETL tier that is configured to perform the extraction, transformation, and loading operations. Examples of ETL tools include Informatica Power Center, Oracle Data Integrator, and IBM WebSphere Datastage, and others. The data sources from which data is extracted may be databases (such as Oracle databases), applications, flat files, and the like. The target data store where the data is stored may include databases, flat files, and the like. In a typical ETL process, desired data is identified and extracted from one or more data sources. The extracted data is then transformed and the transformed data is then loaded into the target data store. Data transforms are often the most complex and, in terms of processing time, the most costly part of an ETL process. The transformations can range from simple data conversions to extremely complex transformations.
ETL processes are commonly used where the data sources are online transaction processing (OLTP) systems and the target data store is a data warehouse. For example, the target data store may be a sales analysis data warehouse and a data source might be an order entry system that records all of the current order activities. A data warehouse is a data store that is designed for query and analysis rather than for transaction processing. A data warehouse usually stores historical data derived from transaction data, it can also include data from other sources. Data warehouses are designed to accommodate ad hoc queries and data analysis and enable separation of analysis workload from transaction workload. A data warehouse enables an organization to consolidate data from various sources.
A common problem faced by ETL designers is how to convert hierarchy data stored in the form of parent-child relationships in a relationship table spanning several rows in a data source (such as an OLTP system) and transform that to a single row that can be stored in the target data warehouse, where the single row identifies a hierarchical path in the hierarchy. ETL designers have to design tools to “flatten” hierarchy data extracted from a data source and store the flattened information into a data warehouse (this problem is sometimes also referred to as how does one perform row-column transpose). This is a very common problem that is faced by ETL designers in various environments in which hierarchies are used such as in applications such as financial applications (e.g., use of value set hierarchies in Oracle applications) that may store cost-center hierarchy data, human resources management applications that may use management reporting hierarchy information, inventory applications that use product hierarchies data, and the like.
Traditionally, flattening of hierarchies is done by the ETL tier using a row-by-row processing approach. This approach is favored by ETL developers (e.g., developers using ETL tools provided by Informatica) due to its ease of implementation. In the row-by-row approach, such as the one used by Informatica, the ETL tier is configured to extract the multiple rows storing parent-child relationships for a hierarchy one at a time. Informatica expressions “remember” the values of the fields in a “previous” parent-child record. Using the principle of induction, a record is created by the ETL tier in an Informatica expression to store values remembered from the previous records by stacking the records sideways. A record set, containing information about many hierarchies, sorted by hierarchy identifier, is fed to such an expression. As each parent-child relationship record within a hierarchy passes through the expression, a hierarchy emerges incrementally at the output layer of the ETL tier. Markers are used to distinguish different hierarchies within a record set containing many hierarchies and filters are used to discard partially constructed hierarchies. In this way, programmers use this design pattern to build a hierarchy to be stored in a data warehouse by accessing the parent-child relationship information that is spread across many records in the data source.
The traditional row-by-row processing solution discussed above has several shortcomings. The row-by-row processing creates a bottleneck in the ETL tier and slows down the ETL operations. The processing also requires substantial processing resources in the ETL tier. Further, this approach cannot be supported in an ELT architecture (as opposed to an ETL architecture), where the transformation phase is performed in the data warehouse or in a staging warehouse, since there is no way to access the values in the field values in the previous record in a sequential record set. In the ELT world, necessity dictates that an alternative method be designed to flatten hierarchies, within the constraints of database features.
Techniques are thus desired that improve the processing time needed for flattening hierarchies for loading into a data warehouse.