Computer users requiring exceptional reliability, redundancy or security, such as very large corporations—and particularly financial sector corporations such as banks, exchanges, brokerages and the like—will often outsource computing needs to third party providers. The preeminent example of such a provider is the International Business Machines (IBM) corporation, which has several thousand users who pay a premium for the capability and reliability of its System z (“z” standing for “zero downtime”) computing platform.
System z users have the benefit of multiple redundant mainframe computers that will continue to seamlessly execute users' workload despite the failure of individual machines. Each group of related computing functions being performed for a user is referred to as a logical partition (LPAR), which is executed by a given machine called a central electronic complex (CEC). The user can set usage limits for LPARs and for groups of LPARs. The present inventors have previously developed improved systems and methods for managing LPAR capacity limits to enhance system performance and control billable costs. An example of such systems and methods can be seen in U.S. Non-provisional patent application Ser. No. 14/199,364, filed on Mar. 6, 2014, the contents of which are herein incorporated by reference in their entirety.
In connection with assigning computing workload to LPARs, users define “service classes.” When defining a service class, a user defines a workload importance level for the workload to be performed therein, as well as a performance goal. In the System z context, there are seven importance levels ranging from 0 (most important) through 6 (least important, also called “Discretionary”). The performance goal is defined in terms of certain performance parameters, such as a percentage of operations completed within a given time. An example of a defined performance goal would be 90% of transactions to be finished with 0.01 seconds clock time.
To allow further flexibility, a service class can include multiple divisions called “periods,” assigned to different importance levels and having different defined performance goals. When workload is introduced into a multi-period service class, it automatically starts in the period with the highest importance level. If the workload exceeds a defined usage limit of the period in which it is currently running, it will be automatically transferred into the period having the next highest importance level. The usage limit is defined in terms of a usage parameter, such as time, processor cycles or the like. In general, multi-period service classes are used to allow shorter running workload to pass more quickly through the system without being unduly delayed by longer running workload assigned to the same service class.
The System z operating system (z/OS) includes a Workload Manager (WLM) for each LPAR which manages service class workload with the LPAR based on importance level, and which also monitors achievement of the defined performance goal. A performance index (PI) is measured for each defined performance goal by z/OS based on the performance parameters in terms of which the goal is defined. A PI of 1.0 indicates that a given defined performance goal is being exactly met, although a range of 0.8 to 1.2 is generally used as an indicator of satisfactory goal achievement, with PI values under 0.8 indicating overachievement (i.e., the performance goal is exceeded) and values over 1.2 indicating underachievement (i.e., the performance goal is not achieved).
Referring to FIG. 1, a chart graphically illustrates the relationship between service classes and WLM importance levels. As can be seen, some of the service classes have multiple periods (e.g., the service classes DDFPROD and DDFTEST—while it is common for a multi-period service class to have only two periods, a service class could include more than two periods). Each service class or period thereof has a defined performance goal, which the WLM monitors achievement of based on the PI.
Significantly, when an LPAR is capacity-limited, the WLM will allocate capacity between service classes (and periods thereof) based upon the PI. In the case of overachievement, the WLM will reduce allocated capacity to the overachieving service class or period in view of a service class or period with a PI indicating underachievement. In the case of a service class/period that is experiencing continuous underachievement in a capacity-limited situation, the WLM is configured to stop allocating more capacity thereto. The logic underlying this configuration being that the defined performance goal of the service class/period simply cannot be achieved with a reasonable allocation of capacity.
A performance goal is normally defined by a user when a service class is created. While a user could manually change the defined performance goals later, this is rarely done. While the WLM will change allocated capacity based on the PI, it does not ever change the defined performance goal. Sub-optimal goal definitions can lead to undesirable results. For instance, the overachievement case described above can effectively result in higher importance workload being slowed down in favor or less important workload in another service class/period. The persistent underachievement case can effectively result in the WLM “giving up” on the affected service class/period.
While features like service class definitions and the WLM importance levels allow billed computer system users some flexibility to manage workload performance on LPARs, further improvements are possible.