Since the year 2000, fixed function Graphics Processing Units (GPUs) are becoming more and more programmable, providing a user with direct and flexible control on the processing primitive, vertex, texture, and pixel streams in graphics chips. Many current GPUs can feature programmability in the form of at least one shader (primitive, vertex, etc.) but generally can process only a few types of data (say 32-bit floating point for vertex and 32-bit integer). The programmable shaders in the graphics pipeline are generally arranged in sequential manner for forwarding data to fixed function units and to each other with a data format conversion if desired.
Also generally involved in the design of GPUs are parallel multiprocessor architecture principles. Application of parallel architecture principles generally utilizes a plurality of same type arithmetic logic units (ALUs) to process different types of stream data in non-uniform program threads. In many circumstances, the ALUs are desired to process different kinds of data for every clock cycle if non-uniform program threads are interleaved.
One of important issues is an implementation of complex mathematical functions (special functions) in such multiprocessor structures. There are generally two ways to implement them: special subroutine executed on general ALU and special hardware unit attached to general ALU which produced result by its request. Software implementation of such functions creates significant performance degradation, which might be unacceptable in case of real-time graphics applications. In the case of multiple ALU combined in SIMD structure such unit should be attached to every ALU which may significantly increase hardware overhead. Such complex functions are not used very often in a shader program and most of the time those special hardware units combined with each general ALU will be idling.
This situation can be partially resolved by sharing the special function unit (SFU) among a plurality of ALUs, but in the case of an SIMD structure, a thread will be stalled until all streams will get their result from shared SFU which will process requests sequentially. It may take several cycles of overhead in each involvement of complex mathematical function in shader program. Special arrangements in the SIMD stream architecture should be made to minimize stall wait cycles and provide smooth stream processing with minimal overhead if non-uniform program threads are interleaved.
While the ALUs used in this multiprocessing manner generally sustain high throughput, the ALUs should be able to process more data streams in short format sharing the same hardware for longer format. Generally speaking, current ALUs for GPUs are configured to process only one format of floating point unit (e.g., 32-bit IEEE format as standard) and generally experience low performance in processing lower accuracy pixel and texture data. Additionally, if another type of data format is supported, the ALU generally works with the same number of streams with little to no throughput improvement nor Single Instruction Multiple Data (SIMD) factor variability regardless of the data format. Further, current ALUs are generally not configured to arbitrarily interleave the flow of instructions (lack of support for non-uniform threads). Additionally, current dual format Multiply Accumulate (MACC) units can generally process only integer data.
Vector machines with a fixed data format and a fixed SIMD factor generally have less of a hardware load and generally process stream data relatively slowly in the case where there are a lesser number of elements in the vector stream than the width of a vector unit. Additionally current graphics shader architecture generally has limited instruction set capabilities in processing different format data in the same instruction.
Thus, a heretofore unaddressed need exists in the industry to address the aforementioned deficiencies and inadequacies.