Due to the increased amounts of data being stored and processed today, operational databases are constructed, categorized, and formatted in a manner conducive for maximum throughput, access time, and storage capacity. Unfortunately, the raw data found in these operational databases often exist as rows and columns of numbers and code which appears bewildering and incomprehensible to business analysts and decision makers. Furthermore, the scope and vastness of the raw data stored in modem databases renders it harder to analyze. Hence, applications were developed in an effort to help interpret, analyze, and compile the data so that a business analyst may readily and easily understand it. This is accomplished by mapping, sorting, and summarizing the raw data before it is presented for display. Thereby, individuals can now interpret the data and make key decisions based thereon.
Extracting raw data from one or more operational databases and transforming it into useful information is the function of data “warehouses” and data “marts.” In data warehouses and data marts, the data is structured to satisfy decision support roles rather than operational needs. Before the data is loaded into the target data warehouse or data mart, the corresponding source data from an operational database is filtered to remove extraneous and erroneous records; cryptic and conflicting codes are resolved; raw data is translated into something more meaningful; and summary data that is useful for decision support, trend analysis or other end-user needs is pre-calculated. In the end, the data warehouse is comprised of an analytical database containing data useful for decision support. A data mart is similar to a data warehouse, except that it contains a subset of corporate data for a single aspect of business, such as finance, sales, inventory, or human resources. With data warehouses and data marts, useful information is retained at the disposal of the decision-makers.
One major difficulty associated with implementing data warehouses and data marts is that a significant amount of processing time is required for performing data transport operations. Because transport processes (data extraction, transformation, and loading) consume a significant amount of system resources, unless transport processes are scheduled to occur during specific time windows during which the operational databases are processing the minimum amount of transactional data, the performance of the operational databases are seriously compromised. In recent data warehouse implementations, because the process of data transport slows down the operational databases, some organizations leave a small nightly window for the data transport process, such as from one to two in the morning.
Because of increasing demands for after-hours database usage and expanded operational hours, there is a need to further increase the throughput of the data transport process in order to assure that the data transport operation does not interfere with the operation of the operational database. Furthermore, in keeping with the proliferation of data mining software applications that capture the rich data patterns hidden inside the data warehouses, some organizations might even require hourly refreshes. Thus, the approaches for non-invasive data transport now focus on increasing the throughput of data transporting process, whereby the whole data transport process can be completed within the narrow time windows allowed. In other words, the pursuit of optimizing throughput (i.e., speed) has begun.
To improve throughput, recent data warehouse application programs that perform data transport functions have relied on the use of multiple fast microprocessors. However, these recent data warehouse application programs use a single pipeline that includes multiple dependent process threads for performing extraction, transformation and loading operations. The use of multiple processors gives significantly improved processing speed and a corresponding increase in throughput. However, these prior art applications do not fully take advantage of the capabilities of the multiple processor environment. For example, delays in read operations slow down the entire process. Furthermore, because of the interdependencies between process threads within the single pipeline, delays affecting one microprocessor are propagated to all of the other processors, resulting in further delays. Thus, in spited of the use of increasingly powerful computers and the use of multiple microprocessors, data transport operations still consume an excessive amount of processing resources and processing time.
What is needed is a method and apparatus for transporting data for data warehousing applications that increases throughput. In addition, a method and apparatus is required that meets the above need and that takes full advantage of the use of a multiple processor environment. The present invention provides a method and apparatus that meets the above needs.