The present invention relates in general to graphics processors and in particular to a double-precision fused multiply-add functional unit for a graphics processor.
Graphics processors are commonly used in computer systems to accelerate rendering of images from two-dimensional or three-dimensional geometry data. Such processors are typically designed with a high degree of parallelism and high throughput, allowing thousands of primitives to be processed in parallel to render complex, realistic animated images in real time. High-end graphics processors provide more computing power than typical central processing units (CPUs).
More recently, there has been interest in leveraging the power of graphics processors to accelerate various computations unrelated to image rendering. A “general-purpose” graphics processor can be used to perform computations in scientific, financial, business and other fields.
One difficulty in adapting graphics processors for general-purpose computations is that graphics processors are usually designed for relatively low numerical precision. High quality images can be rendered using 32-bit (“single-precision”) or even 16-bit (“half-precision”) floating point values, and the functional units and internal pipelines are configured to support these data widths. In contrast, many general-purpose computations require higher numerical precision, e.g., 64 bits (“double-precision”).
To support higher precision, some graphics processors use software techniques to execute double-precision computations using a sequence of machine instructions and 32-bit or 16-bit functional units. This approach slows throughput; for instance, a hundred or more machine instructions might be required to complete a single 64-bit multiplication operation. Such long sequences can significantly reduce the double-precision throughput of the graphics processor. In one representative case, it is estimated that the graphics processor would complete double-precision computations at about ⅕ the throughput possible with a high-end dual-core CPU chip. (By comparison, the same graphics processor can complete single-precision computations at about 15-20 times the throughput of the dual-core CPU.) Because software-based solutions are so much slower, existing graphics processors are rarely used for double-precision computations.
Another solution is simply to make all of the arithmetic circuits of the graphics processor wide enough to handle double-precision operands. This would increase the graphics processor's throughput for double-precision operations to match the single-speed throughput. However, graphics processors typically have dozens of copies of each arithmetic circuit to support parallel operations, and increasing the size of each such circuit would substantially increase the chip area, cost and power consumption.
Still another solution, as described in commonly-owned co-pending U.S. patent application Ser. No. 11/359,353, filed Feb. 21, 2006, is to leverage single-precision arithmetic circuits to perform double-precision operations. In this approach, special hardware included in a single-precision functional unit is used to perform a double-precision operation iteratively. This approach is considerably faster than software-based solutions (throughput might be reduced, e.g., by a factor of 4 relative to single-precision throughput, rather than by a factor of ˜100), but it can significantly complicate the chip design. In addition, sharing the same functional unit between single-precision and double-precision operations can result in that unit becoming a bottleneck in the pipeline if too many instructions require the same functional unit.