Workload Management (WLM) systems are used to help control access to machine resources. Typically, WLM systems consist of monitors that are tracking the usage of the resources that they are monitoring and managing, work queues to store workloads that are not able to run immediately, and policies for determining which workload should run next. WLM systems may be implemented at low levels in a software stack, e.g. at an Operating System (OS) level or at lower levels.
Data integration may be described as extracting data from a source, transforming the data, and loading the data to a target. That is, data integration is Extract, Transform, Load (ETL) processing. Data integration processing engines may be scalable and capable of processing large volumes of data in complex data integration projects. It is common for multiple users (e.g., customers) and projects to share a single data integration processing engine that is responsible for handling all of the data integration processing for those multiple users. This high volume, highly concurrent processing may be resource intensive, and users try to balance the availability of system resources with the need to process large volumes of data efficiently and concurrently.
Due to the complexity of these environments, some data integration execution environments need application level workload management functionality, rather than low level (e.g., OS level) functionality. The resources that need to be managed may be application resources, which are a form of logical resources, as opposed to system resources (e.g., Central Processing Unit (CPU), memory, storage, etc.).
A WLM system may be designed to manage the number of workloads that are running concurrently, indirectly managing the machine resources required to run the workloads. The WLM system may also manage the number of workloads that are allowed to start in a given time window. These two aspects may be described as application resources that the WLM system is managing.
While managing application level resources, a WLM system itself uses some logical/application level resources in order to track and manage workloads (which may or may not be a resource that it is intended to manage). For example, the WLM system may prepare and hold incoming workloads, which may consume one or more application resources (while not placing much additional load on physical machine resources).
That is, the WLM system uses some resources to queue workloads for execution. It is possible that there are physical machine resources available to use, but the application's WLM system does not have resources to manage anymore workloads. For example, an application may have a logical resource (“slot”) for handling 100 items that are executing, which exceeds what the physical machine can actually execute concurrently. With a WLM system in place, workloads may be queued up, which takes little additional machine resource, but may consume a logical resource, i.e. a “slot”. When enough of these workloads are queued up, this logical resource may be exhausted.
Although there are machine resources available, when logical resources are exhausted, current WLM systems may become unstable and/or fail with unexpected error conditions.
In some cases, the WLM system reaches the state that no more workloads can be prepared, and the WLM rejects the workload outright. Then, the client may re-submit the workload at a later time.
In some other cases, the client automatically tries to re-submit the workload until the WLM system will take that workload. In such cases, it may be difficult to determine the frequency of re-submissions. Also, if multiple clients are trying to re-submit workloads, then any one of the clients may have its workload accepted by the WLM system, regardless of when or how often that client has re-submitted its workload. This can lead to situations where one submitted workload can wait for hours and not get into the WLM system, while another workload waits only seconds and takes a newly available slot. Moreover, the state of the workload is undefined in that the WLM system knows nothing about that workload until the workload is accepted by the WLM system. So, the WLM system will not report this workload as existing in a queue, and the WLM system does not utilize any of the WLM queue management functions on the workload (moving up in the queue, switching queues, cancelling the workload).
Also, more organization to the client re-submissions may added by creating a centralized queuing mechanism that is not managed by the WLM system.