Generally, example embodiments of the present disclosure relate to hardware accelerators, and more particularly to providing a method, system, and computer program product for streaming attachment of hardware accelerators to computing systems.
General purpose processors like Intel®, AMD® and IBM POWER® are designed to support a wide range of workloads. If processing power beyond existing capabilities are required then hardware accelerators may be attached to a computer system to meet requirements of a particular application. Examples of hardware accelerators include FPGAs (Field Programmable Gate Arrays), the IBM Cell B.E. (broadband engine) processor, and graphics processing units (GPUs). Hardware accelerators are typically programmable to allow specialization of a hardware accelerator to a particular task or function and consist of a combination of software, hardware, and firmware. Such hardware accelerators may be attached directly to the processor complex or nest, by PCI-express (peripheral component interconnect) IO (input-output) slots or using high-speed networks, for example, Ethernet and Infiniband®.
Call-return programming models are typically used for accelerator attachment to high-end computing systems. In the call-return programming model, a processing unit (PU) may make a call to an accelerator with task blocks (task descriptors), parameters and/or input data blocks. The PU may wait until a reply or result is received. An accelerator run-time system on the PU usually generates a task block for a given input block of data directed to the accelerator. The task block and data may then be passed to the accelerator for processing. This works well if the input data block size is bounded. Creating a task block for every byte of data (e.g., for stream processing) to be processed on an accelerator may be prohibitive and may create undue overhead if the length of the stream is unknown. Therefore, call-return programming models are inefficient for streaming data.
For example, a stream residing on a disk may be several gigabytes in length and may be expected to approach terabytes or even larger in future workloads. A runtime system for streaming accelerator attachments may directly pass bytes to stream processing handlers on the accelerator for processing. Task blocks should be generated once for the entire stream and do not have to be generated for each byte or bit requiring accelerator processing.
However, call-return programming models may not be equipped to handle an external stream entering an accelerator connected to a high-end computing system. Data about the stream does not exist on the high-end computing system and may not be used in a call to the accelerator for stream-based processing. For example, several large workloads may consist of streams arriving externally at an accelerator and subsequently being forwarded to the high-end computing system for archival, storage, and further processing.
Therefore, with streaming workloads likely to become ubiquitous in the future, a new method to handle accelerators attached to high-end computing systems for stream processing may be prudent.