1. Field
The disclosure relates generally to scheduling resources by a runtime environment, and more particularly to informing the runtime environment, by an attribute of a descriptor, of which instructions to run to schedule a plurality of resources for completion of a task in accordance with a level of quality of service in a service level agreement.
2. Description of the Related Art
Customers may contract with one or more providers for computer services. In other cases, customers may also have in-house computers which they may manage based on quality of service needs. The computer services entail data processing by multiple machines over one or more networks. Indeed, the multiple machines may involve tens of thousands of processors performing calculations for thousands of customers, each customer having a plurality of accounts, and each account having one or more applications necessary to perform various data processing requirements. In addition to multiple applications within each account, there can be multiple users within an account. In many cases, each account has different resiliency concerns depending on the application and the user.
Resiliency concerns differ for a number of reasons. First, some errors have extreme consequences while other errors have no real consequence. An example of an extreme consequence may be an error in a flight control system. An example of a data error without a significant consequence may be an error in copying a video file.
Whether catastrophic or inconsequential, data processing errors must be managed appropriately. Data processing errors occur at runtime. Runtime refers to the time one or more programs are run on one or more resources. Management of data processing errors accepts the fact that integrated circuits occasionally fail or produce incorrect data at runtime. Failures and incorrect data can be controlled and corrected to the degree computing resources are committed to such control by a runtime environment. The cost of committing computing resources to control of errors increases along with the increasing complexity of ever smaller integrated circuit design, and the increasing complexity of systems connecting vast numbers of machines and applications in virtual or cloud computing environments. Soft errors, due to transient particles and hard errors, or due to equipment failure, result in incorrect running, data integrity problems, and machine stops. While a particular piece of hardware can be designed to stringent specifications, the resiliency of that particular piece of hardware is affected by other devices that may be attached to it directly or through a network by a runtime environment. The attached devices may have been built to vastly different resiliency standards. Moreover, data flows between hardware through input/output adapters can also result in errors.
Perhaps the most common method for avoiding the hard errors resulting from equipment failure, and for detecting soft errors due to transient particles, involves running two or more copies of an application either on the same hardware or on different hardware. The results of the computation of the two copies are compared frequently. When the results of the computation of the two copies do not match, an error is detected. Resiliency can be increased further by adding additional redundancy. On the one hand, redundancy enables detection and correction of errors. On the other hand, redundancy involves additional cost in resources, performance, and power consumption. For example, when an application is run twice, memory and processing resource demands increase. The increase in memory and processing resource demands translates into higher costs due to power consumption and the time an account is billed for using resources.
Accordingly, there is a need for a method and apparatus, which takes into account one or more of the issues discussed above as well as other possible issues.