Public and private sector users and organizations generate large amounts of information data items and objects that are amenable to Extract, Transform and Load (ETL) transactions to process, understand and otherwise utilize the underlying information. ETL refers to a process in database usage and especially in data warehousing that “extracts” data from homogeneous or heterogeneous data sources, “transforms” the data for storing it in a specified or desired format or structure for querying and analysis purposes, and “loads” the transformed data onto a designated target database destination, such as an operational data storage device (a “store”), a data warehouse, a data mart, etc. Data extraction may be time intensive, and accordingly some implementations perform (execute) all three ETL phases in parallel with respect to different data items, enabling resource and time efficiencies.
ETL systems may integrate data from multiple, different applications or systems, which may be developed and supported by different entities or organizations, and hosted on separate computer hardware components and networks. Disparate systems containing original data may thereby be managed and operated by different users, for example a cost accounting system may combine data from payroll, sales and purchasing systems.
ETL processes may experience a wide variety of workload demands, each calling for different amounts and types of computing resources. Problems arise in efficiently deploying cloud models to meet demands for variable workloads, wherein the required amounts of resources needed to execute large job runs may be easily procured, and then altered (stopped, etc.) as needed for other, smaller job runs.
Prior art ETL workload management techniques generally require operators to revise job configurations and scheduling to allocate appropriate resources manually, or through some fixed programmatic method such as in a scripted scenario, where additional servers are provisioned through an API before job execution, and then de-provisioned after execution. One prior art approach is discussed by “Exploiting Time-Malleability in Cloud-Based Batch Processing Systems” (Luo Mai, Evangelia Kalyvianaki, and Paolo Costa, Workshop on Large-Scale Distributed Systems and Middleware (LADIS'13), ACM, November 2013), wherein the scheduling of jobs that are “time-malleable” are changed to correspond to times or greater resource availability as a function of a pricing model wherein the later a job is completed, the lower rate a user pays. To avoid unbounded completion time users may also specify the longest acceptable deadline of completion of the jobs and a maximum price they are willing to pay.
“An Optimization Scheme for Bank Batch Processing Based on Cloud Computation” (X. Zhao, G. M. Li, Applied Mechanics and Materials, Vol. 539, pp. 339-344, July 2014) teaches batch processing optimization schemes that divide a business process job into parallel and independent tasks to which IT resources are differentially allocated.
“Resource Aware Workload Management for Autonomic Database Management Systems” (by Wendy Powley, Patrick Martin, Natalie Gruska, Paul Bird and David Kalmuk, ICAS 2014: The Tenth International Conference on Autonomic and Autonomous Systems, IARIA, April 2014) teaches a “resource aware” scheduling approach that schedules queries to run only when doing so is unlikely to overwhelm identified CPU, I/O and sort heap memory resources.
However, changing job schedules and parameters in order to conform to known cloud environments does not indicate or enable the revision of cloud resource deployments. This results in inefficiencies and high execution costs, as job scheduling and reallocations may not have a significant impact on resource efficiency and cost savings in all cloud resource configurations.