Field of the Invention
Embodiments of the present invention relate generally to graphics processing and, more specifically, to improved efficiency in a distributed instruction set architecture.
Description of the Related Art
In computer systems, in general, and in graphics processing units (GPUs), in particular, a number of different instructions are typically required, each instruction directing a particular operation. In each generation of computer or GPU development, the instruction set architecture determines the instructions that are available for execution and the manner in which each instruction or instruction group is performed. This instruction set architecture is implemented by a collection of functional blocks that can range, in one extreme, from a wide array of elements, each optimized to perform a single instruction, to a small array of elements, in the other extreme, each designed to be able to perform multiple instructions. With many specialized elements, performance can be maximized, as each instruction can be optimally performed at any time, at the expense of increased power overhead due to the large number of other specialized elements that are left idle and the large area required to implement them. With a small number of more generalized elements, power efficiency can be improved due to the reduced area and reduced number of idle elements, at the expense of performance. Each generalized element cannot be optimized for all instructions, and delays can occur due to scheduling conflicts.
There is commonly a dominant instruction, that is, an instruction that is required to be performed much more frequently than other instructions. Typically, this instruction is a fused floating point multiply-add function (FFMA). Allocating the number of elements that perform the FFMA instruction (FFMA elements) and determining if these elements may be expanded with additional functionality to perform other instructions is a significant aspect of the instruction set architecture implementation.
One drawback to the above approach is that the generally negative correlation between processing performance and power usage affords no opportunity to simultaneously realize improved processing performance and more efficient power usage. Further, less commonly performed instructions impose a burden on the system design that is nearly equal to that of most commonly required instructions such as the FFMA.
As the foregoing illustrates, what is needed in the art is a more optimized technique for implementing an instruction set architecture.