Data warehousing, the collection and integration of large amounts of enterprise related information, is essential to generating reports for large enterprises. For example, analysts at an airline use such reports of ticket sales grouped by airport, day of week, and other criteria to determine which routes the airline should add or drop. Merchandizing analysts at a clothing company use reports of clothing sales by region and season to determine which stores need to be restocked with which items. These reports are based on data collected from many different sources, such as point-of-sale terminals, and aggregated into a large database, commonly called a “data warehouse.” Once the relevant data has been amassed within the data warehouse, reports summarizing the data can be generated for companies to use.
The sources that feed information into the data warehouse, however, tend to be quite different from one another or “heterogeneous,” and the process of delivering the information produced by a particular data source into the data warehouse is often specific to that particular data source. In small enterprises, it is often economically acceptable for the data to be loaded manually by an employee who formulates and executes the appropriate commands for converting the data into the proper format, moving the data into the proper location, and loading the data into the data warehouse. Larger enterprises, however, have so much data that an automated approach is more economically desirable. One common automated approach to loading data into a data warehouse is known as “job-based scheduling,” which involves scripting and mainframe technology that has been initially developed more than 30 or 40 years ago.
With job-based scheduling, an information technology (IT) engineer creates scripts that are executed on a periodic basis (e.g. nightly at 4:00 a.m.) and that control, at a detailed level, how the data is loaded, based on the particular requirements and formats of the incoming data and data warehousing. For example, the script may include the appropriate platform-specific commands for renaming a file containing the incoming data, causing the file to be moved to an appropriate location (e.g. directory or folder) on the computer system for the data warehousing, running specific data conversion routines to put the file of incoming data into a format the data warehouse is capable of processing, and invoking the data warehouse-specific command to cause the data warehouse to load the reformatted file.
Moreover, each data source has its own formatting and may be refreshed (i.e., become available with new data) at different times and periodicities than those of other data sources. For example, one data source can generate new information hourly, while another is daily. As a result, job-based scheduling script writers must be able to personally account for the differing data formats and handle the asynchronicity of the arrival of the data at different times.
Handling these issues, however, requires much skill and careful coding with complex methods, such as blocking jobs, parent job handlers, and so forth. As mainframe and other large computing system age, the numbers of people competent to craft such job-scheduled scripts diminishes with every year. The logic to handle these issues is hard-coded into the various scripts that are performed by the job-based, and upgrading or adding new data sources and data warehouses often require extensive recoding efforts and additional expense and time.
Some extraction, transformation, and loading (ETL) systems have been developed to address these complexities, but these ETL systems are often invasive, requiring tight integration both data sources and the data warehousing systems supplied by specific vendors and are difficult to extend to other installations or to be used with systems supplied by other vendors.