The present invention relates to methods and apparatus for controlling a single instruction, multiple data (SIMD) processing pipeline.
In recent years, there has been an insatiable desire for faster computer processing data throughputs because cutting-edge computer applications involve real-time, multimedia functionality. Graphics applications are among those that place the highest demands on a processing system because they require such vast numbers of data accesses, data computations, and data manipulations in relatively short periods of time to achieve desirable visual results. These applications require extremely fast processing speeds, such as many thousands of megabits of data per second. While some processing systems employ a single processor to achieve fast processing speeds, others are implemented utilizing multi-processor architectures. In multi-processor systems, a plurality of sub-processors can operate in parallel (or at least in concert) to achieve desired processing results.
In a deep pipeline SIMD processor subject to varying latency data paths, the existence of scalar and vector (SIMD) operations may complicate data dependency checking. The SIMD processor may carry out many operations and/or instructions, each with its own, and potentially different, latency. For example, the Intel IA-32 SSE instruction set employs different instructions for scalar and SIMD computations/operations. Scalar operations use the same registers but always use the same slice. If unused slice words of a destination register have to remain unchanged, the complexity of proper pipeline operation and data forwarding is greatly increased.
Further, a SIMD processor includes a plurality of stages, where each stage may perform its operation at the same time and seek to dispose the result thereof in a destination register. Data dependency checking becomes more complex when two or more operations in the pipeline have the same destination register with different unused slices. Stall conditions may be exacerbated when write-after-write (WAW) dependency or read-after-write (RAW) dependency is encountered. RAW dependency is particularly problematic because each slice has a different dependency.
The complexity associated with dependency checking of unused slices may be addressed by reading the destination register (operand) of an instruction to be issued (e.g., the source registers) and pipelining the data of unused slices without modification. Unfortunately, this requires additional hardware for reading the destination register and may result in an increase in stalling. This is so because the destination operand (as well as the source operands) may have RAW dependencies. Still further, when the data of unused slices are pipelined without any modification, power savings are difficult to achieve.
Another way in which the complexity associated with dependency checking of unused slices may be addressed is by delaying or stalling the issuance of the instruction for a sufficient time for most operations (with the same destination register) in the pipeline to finish their write-back stage. Unfortunately, this may cause significant performance degradation.