General purpose microprocessors are designed to support a wide range of workloads and applications, usually by performing tasks in software. If processing power beyond existing capabilities is required then hardware accelerators may be integrated in a computer system to meet requirements of a particular application.
Hardware accelerators may perform certain tasks more efficiently then processors running a software routine. One aspect of hardware acceleration is that algorithmic operations are performed on data using specially designed hardware rather than generic hardware, as is the case with software running on a microprocessor. A hardware accelerator can be any hardware that is designed to perform specific algorithmic operations on data. In this regard, hardware accelerators generally perform a specific task to offload CPU (Software) cycles. This is accomplished by transferring the data that requires processing into the domain of the hardware accelerator (usually part or all of a chip or a circuit board assembly), performing the hardware accelerated processing on that data, and then transferring the resultant data back to the software domain.
Examples of hardware accelerators include the IBM Cell B.E. (broadband engine) processor, encryption units, compression/decompression engines and graphics processing units (GPUs). Hardware accelerators may be programmable to enable specialization of a particular task or function and may include a combination of software, hardware, and firmware. Hardware accelerators may be attached directly to the processor complex or nest, by PCIexpress (peripheral component interconnect) IO (input-output) slots or remotely via high-speed networks.
Hardware accelerators may be implemented in separate integrated circuits including FPGAs (Field Programmable Gate Arrays) and connected via a bus to a general purpose microprocessor, Multiple co-processors serving as hardware accelerators may be instantiated on the same die as the processor or as part of a multi-chip module (MCM), as in the case of IBM's Power series mainframe systems.
Typical uses of hardware accelerators may include compression and decompression of memory pages to conserve overall memory usage. If a block of data residing in memory has not been recently used and main memory space is limited, compressing the block can reduce the address space necessary for storage and when the same data is needed for subsequent processing it can be recalled and decompressed. Having a dedicated hardware accelerator to perform this function relieves the general purpose processor from this task, performs the compression and decompression operations at higher throughput, allowing the general purpose processor to continue executing other processing functions, and maximizes efficient utilization of finite memory resources.
Similarly, when encrypted data is received from an I/O device for processing, encryption/decryption engines enable analysis of the received data to proceed more efficiently, which can speed timely analysis of, for example, financial or telemetry data. In this regard, accelerators may aid processing merely by transposing data formats compatible with a certain application or protocol. Off loading this function from the main processor eliminates processing bottlenecks associated with such tasks.
Management of a diverse pool of processing resources may be accomplished through high level controllers known as hypervisors or virtual machine managers (VMM). These implement hardware virtualization techniques allowing multiple operating systems to run concurrently on a host computer. The hypervisor provides a virtual operating platform and manages the execution of the guest operating systems and applications. Multiple instances of a variety of operating systems may share the virtualized hardware resources. Hypervisors are installed on server hardware whose only task is to run guest operating systems. Non-hypervisor virtualization systems are used for similar tasks on dedicated server hardware, but also commonly on desktop, portable and even handheld computers.
Logical partitioning (LPAR) allows hardware resources to be shared by means of virtualization among multiple guest operating systems. One guest operating system comprises one LPAR. Two LPARs may access memory from a common memory chip, provided that the ranges of addresses directly accessible to each do not overlap. One partition may indirectly control memory controlled by a second partition, but only by commanding a process in that partition. CPUs may be dedicated to a single LPAR or shared. On IBM mainframes, LPARs are managed by the hypervisor. IBM mainframes operate exclusively in LPAR mode, even when there is only one partition on a machine. Multiple LPARs can run on one machine or be spread across multiple machines.
Efficient utilization of a finite number of hardware accelerators requires a queue management system to prioritize processing jobs and ensure fairness in allocating available processing acceleration resources amongst the LPARs. Computer systems must accommodate scheduling, dispatch, execution and perhaps termination of a wide variety of processing jobs with different execution latencies and vastly different memory constraints. High priority applications, even those with predictable processing requirements, may demand a disproportionately large share of processing resources, thereby inhibiting completion of lower priority jobs—perhaps indefinitely—because a higher priority job may always take precedence over a lower priority job. To prevent a high bandwidth job from completely dominating acceleration resources, a fairness protocol is needed to ensure lower priority jobs are executed within an acceptable period of latency.
Even in computer systems employing hardware acceleration, co-processing resources are limited and must be carefully managed to meet expected throughput requirements of all applications running on the system. In this regard, processing latency would be enhanced by a queue management scheme capable of dynamically configuring available hardware acceleration queues so processing jobs may be assigned to queues based on usage, job latency and capacity.