Field of the Invention
This invention relates generally to the field of computer processors. More particularly, the invention relates to an apparatus and method for a hybrid latency-throughput processor.
Description of the Related Art
Invoking accelerators today requires going through a driver interface. In a system in which a hierarchical protection domain is used, this means switching to ring 0 and copying data to a different address space, which consumes significant time and processing resources. Due to the high latency, such accelerator interfaces are also inherently asynchronous. Programmable accelerators require the accelerated code to be implemented in their own instruction set architecture (ISA).
Some current processor architectures attempt to address some of these concerns but provide only a coarse-grained asynchronous mechanism with a high latency between the accelerated task request and its execution. In addition, current architectures use a non-X86 ISA, which requires a separate toolchain to generate and integrate the accelerated task with the main x86 program.
In addition, current asynchronous hardware accelerators (e.g., GPUs) allow the accelerated task to execute unrelated to the application thread that triggered it. This allows the application thread to handle exceptions and/or interrupts without affecting the accelerated task, and even allow the application thread to migrate between cores without impacting the accelerated task location on the system.
Current synchronous hardware accelerators need to ensure that interrupts, exceptions, context switches and core migrations are still functionally correct and ensure forward progress. This is done either by (1) ensuring the accelerator is short enough and doesn't cause any exceptions, so that any interrupts are deferred until the accelerator is done; (2) maintaining the accelerator's forward progress in existing architectural registers (e.g., REPMOV); or (3) defining new architectural registers to hold the accelerator status, and adding them to XSAVE/XRESTORE.
In addition, throughput program code is currently developed in specialized programming languages and instruction set architectures (ISAs) (e.g., for DSPs and GPUs). As such, throughout programs must be written in a different ISA and tool-chain than latency programs. A single application which has both latency and throughput parts must be split into separate sub-programs. Once separated, each sub-program runs on different hardware, incurring significant overhead in control and data transfer between the two sub-programs. Scheduling of the separate sub-programs over the different hardware resources is done separately by different entities such as the operating system and driver or middleware.