1. Field of Invention
The present invention relates generally to the field of database archival. More specifically, the present invention is related to an archiving method for extract, transform, and load tasks.
2. Discussion of Prior Art
As more and more daily business decisions and analyses require the use of large volumes of complex warehouse data, the time required for building warehouse data has grown exponentially. End users of a data-warehousing environment often need to complete a series of extract, transform, and load (ETL) tasks in one or more job streams within a limited time window on some periodic basis. An ETL task is defined by the manipulation of data; the extraction of data from a source, the insertion into a warehouse, or the modification of existing warehouse data. A job stream describes a set of related ETL tasks chained in a dependency tree structure. Each prerequisite task in a job stream must be completed successfully before subsequent dependent tasks can be started. If a prerequisite ETL task fails, certain actions must be immediately taken to ensure the continuation of the complete task flow, thus ensuring the execution of all subsequent, dependant ETL tasks. The continuation of task flow is necessary to ensure final warehouse data can be accurately built in a timely fashion. Capturing and managing these ETL task activities becomes vital to the delivery of current warehouse data for timely business decision-making. Being able to meet the service level commitment in building and refreshing warehouse data is crucial to the success of a business.
Data warehouse end users need a relatively large warehouse data availability time window for complex data analysis. Therefore, warehouse administrators and operators must monitor ETL tasks performed by end users closely to ensure that these tasks are completed successfully within a designated time window. Timely and successful completion of these tasks ensures that warehouse end users can access current warehouse data with low latency. To accomplish this, exception conditions must be corrected promptly so that a scheduled ETL task flow can resume without consuming a substantial amount of end users' data analysis time.
As the volume of complex business data volume continues to grow on a daily basis and data warehouse end users continue to demand larger availability windows to access warehouse data, the need arises for a monitoring system that can provide an efficient method of problem determination and future auditing with regards to ETL tasks. To accomplish efficient problem determination and to provide an audit trail, the history of ETL task execution statuses must be preserved. However, the status of ETL tasks in a typical data-warehousing environment may not be persistent since the execution status of an ETL task changes to indicate ETL task progress; only the final execution status of a completed ETL task is stored in operational warehouse metadata. Thus, it is necessary to store operational metadata that is frequently retrieved and updated, especially during the time of ETL task execution. When an ETL task terminates abnormally, it is necessary to have a record of all interim execution statuses for a given task prior to the failure for the purposes of problem determination and for future auditing. However, preservation of all interim execution statuses in operational warehouse metadata can potentially impact the performance of ETL tasks due to increased metadata volume. In addition, preserving all interim execution statuses creates an increase in the administrative load required to maintain and control warehouse metadata. To reduce latency for end users in a data-warehousing environment, normalized warehouse metadata must not be queried excessively while the data warehouse is online. Querying for ETL execution status, for example, reduces performance and poses the risk of misinterpreting or corrupting changing operational metadata.
In one approach, an archived warehouse metadata is used to capture and store the changes in operational warehouse metadata. This approach is limited in that archived warehouse metadata has the potential to grow without bound, thus furthering maintenance concerns. If archived warehouse metadata is “pruned” by data warehouse administrators as it grows, no audit trail records will remain for future reference. In addition, a trigger mechanism capturing the changes in operational warehouse metadata continuously providing changes to archived warehouse metadata may occur simultaneously with a data warehouse end user performing data analysis on the same archived warehouse metadata. Simultaneously accessing or operating on archived warehouse metadata can influence or even corrupt analyses.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.
Thus, there is a need in the art for a dynamic method of data capture and storage of past and current statuses of ETL tasks that does not impact the performance of the existing data warehousing environment and access operations. A provision must be made for data warehouse administrators to control the time, frequency, and granularity at which changed operational warehouse metadata to be captured for analysis is stored and refreshed as well as for the capability of easily recovering damaged archived warehouse metadata. Also necessary in the art are methods for capturing ETL exception conditions and for generating error recoveries for handling exception conditions.