Traditional Integration of Co-Processors
As semiconductor manufacturing processes are reaching an era that approaches 1 trillion transistors per die, design engineers are presented with the issue of how to most effectively put to use all the available transistors. One design approach is to implement specific computation intensive functions with dedicated hardware “acceleration” on die along with one or more general purpose CPU cores.
Acceleration is achieved with dedicated logic blocks designed to perform specific computation intensive functions. Migrating intensive computations to such dedicated logic blocks frees the CPU core(s) from executing significant numbers of instructions thereby increasing the effectiveness and efficiency of the CPU core(s).
Although “acceleration” in the form of co-processors (such as graphics co-processors)) are known in the art, such traditional co-processors are viewed by the OS as a separate “device” (within a larger computing system) that is external to the CPU core(s) that the OS runs on. These co-processors are therefore accessed through special device driver software and do not operate out of the same memory space as a CPU core. As such, traditional co-processors do not share or contemplate the virtual addressing-to-physical address translation scheme implemented on a CPU core.
Moreover, large latencies are encountered when a task is offloaded by an OS to a traditional co-processor. Specifically, as a CPU core and a traditional co-processor essentially correspond to separate, isolated sub-systems, significant communication resources are expended when tasks defined in the main OS on a GPP core are passed to the “kernel” software of the co-processor. Such large latencies favor system designs that invoke relatively infrequent tasks on the co-processor from the main OS but with large associated blocks of data per task. In effect, traditional co-processors are primarily utilized in a coarse grain fashion rather than a fine grain fashion.
As current system designers are interested in introducing more acceleration into computing systems with finer grained usages, a new paradigm for integrating acceleration in computing systems is warranted.
CPUID/XSAVE/XRSTORE Instructions and the XCR0 Register
An Instruction Set Architecture (ISA) currently offered by Intel Corporation supports mechanisms for enabling, externally saving and re-storing the state of certain hardware supported “extensions” to the ISA's traditional instruction set. Specifically, according to one implementation, the ISA's floating point instructions (x87), 128 bit vector instructions (SSE) and 256 bit vector instructions with 3 operand instruction format (AVX) are each viewed as separate “extensions” to the ISA' s traditional instruction set (x86).
A control register, XCR0, that is internal to the processor can be written to by software to enable any one or more of these extensions. Specifically, the XCR0 register maintains one bit for each the three extensions (i.e., an x87 bit, an SSE bit and an AVX bit). Software (e.g., the operating system (OS)) to permit software to individually set the various bits to individually enable the x87/SSE/AVX extensions according to its own intentions. The XCR0 register is understood to have additional, currently undefined bit positions, so that additional extensions can be added in the future and enabled/disabled accordingly.
A CPUID instruction has been implemented in the ISA that the software can use to determine how much memory space is needed to externally store the state information of the enabled extensions. For example, with various input operand values, the CPUID instruction can be executed by the software to determine: i) the total amount of memory space needed to store all the state information of all the enabled extensions; ii) the total amount of memory space needed to store all the state information of any particular one of the enabled extensions. Thus, for example, if the x87 and SSE extensions are enabled, the CPUID instruction can be used to determine: i) the total amount of memory space needed to store all the state information of the x87 and SSE extensions; ii) the total amount of memory space needed to store all the state information of just the x87 extension; and, iii) the total amount of memory space needed to store all the state information of just the SSE extension.
Here, as the state information for an extension largely corresponds to the information stored in the extension's associated data registers (i.e., the floating point registers for the x87 extension, the 128 bit registers for the SSE extension, the 256 bit registers for the AVX extension), the CPU hardware knows “how large” the register space is for each of its extensions and can readily provide/return such information as a resultant of the CPUID instruction.
As such, in a typical case, software will execute the CPUID instruction to understand how much memory space needs to be allocated for the state information of the various extensions it has enabled, then, proceed to allocate such memory space.
The XSTORE instruction is called by software to externally save the state information of any/all enabled extensions. Here, the memory address where the state information is to be saved is provided as an input value to the instruction and the processor core causes the state information of the extension(s) to be written to system memory at that address. Less than all of the enabled extensions may have their state information saved on any particular execution of the XSTORE instruction. A mask register utilized by an executing XSTORE instruction whose bit positions essentially correspond to those of the XCR0 register is used to selectively specify which enabled extensions are to have their state information stored by the XSTORE instruction. The externally stored information also includes an XSTATE_BV vector field that corresponds to the mask register information. That is, the XSTATE_BV vector field indicates which of the extensions have had their state information externally stored in memory.
The XRSTOR instruction corresponds to the logical opposite of the XSTORE instruction. In the case of the XRSTOR instruction, an input value to the instruction specifies where the state information for the extension(s) are stored in system memory. Execution of the instruction causes the processor core to read the state information from memory at that address and load the state information into the appropriate extension register space. As part of the loading process, the processor first reads the contents of the XSTATE_BV vector field stored in memory to understand which extensions have had their state information stored in memory. The processor then loads into itself from memory the state information of those extensions that have had their state information externally stored in memory as indicated in the XSTATE_BV vector field. Ideally, the XRSTOR instruction is provided a mask vector that matches the contents of the XSTATE_BV vector read from memory and whose set bits correspond to enabled extensions in the XCR0 register.