Data is often migrated from one location to another for data warehousing (DW). Present data migrations are generally performed by Extract, Transform, Load (ETL) processes. ETL is a process in database usage and warehousing that involves extracting data from outside sources; transforming the data to fit operational needs; and loading the data into the target database or warehouse. Many data warehousing processes are capable of consolidating data from different sources. Common data source formats include relational databases, and flat files, but may also include non-relational database structures such as an Information Management System (IMS), Virtual Storage Access Method (VSAM), Indexed Sequential Access Method (ISAM), etc.
Known ETL approaches typically involve one of two approaches: (i) large volume, but low frequency; and (ii) high frequency, but low volume. In the first approach, large volumes of files are processed linearly in a batch window. This processing typically occurs after business hours. In the second approach (i.e., active, trickle), relatively small data volumes are continuously loaded in near real time throughout the day. Neither approach provides continuous real time large volume data storage. Indeed, ETL processes involve considerable complexity, which may lead to bottlenecks in operation and speed limitations especially in systems in which access to such data in real time is desirable. The range of data values or data quality in an operational system may exceed the capability of traditional ETL systems that rely on linear processing, where the extraction, transformation, and loading are preformed in sequence. For example, the linear processing of traditional ETL systems may lack the speed requirements to process the data in real time. As the volume of data that is available for archiving grows, it is becoming increasingly more challenging to process this information. In this regard, a common source of problems in ETL is a large number of dependencies among ETL tasks. For example, task 2 can only start after the completion of task 1, and task 3 can only start after the completion of task 2. Thus, when one stage slows down, all the subsequent stages are delayed, having to wait until processing completes in the prior stage.
Hence a need exists for an archiving method and system that provides loading, is capable of being queried, and archived on a single platform in a time and cost efficient way.