Pipelined computing or processing architectures are well known, and such pipelined architectures vary in depth (e.g., the number of pipeline stages). Many pipelined architectures include five basic pipeline stages: (1) fetch, (2) decode, (3) execute, (4) memory access, and (5) writeback stages. The general operation of these stages is well known.
Reference is made to FIG. 1 showing a portion of such a basic pipelined architecture. Specifically, shown in FIG. 1 is a register file 12 and an arithmetic logic unit (ALU) 14. Typically, the execute stage of a pipelined architecture includes one or more processing units (such as an ALU) for carrying-out processing operations associated with the instruction. The ALU 14 of FIG. 1 includes various dashed lines to represent multiple cycles of operation (e.g., clock cycles).
With regard to the register file 12, as is known, data is retrieved from system memory into a “register file,” which is an area of high-speed memory, configured in the form of registers. Once data is in the register file 12, it typically can be retrieved by any of the pipeline stages (e.g., fetch, execute, etc.) unit within a single clock cycle. The register file 12 has also been depicted near the bottom of FIG. 1 (in dashed line) to denote the writeback communication of data from the execute stage (or ALU 12) to the register file 12. To simplify the illustration, other pipeline stages have not been depicted.
As is known, to improve the efficiency of multi-dimensional computations, Single-Instruction, Multiple Data (SIMD) architectures have been developed. A typical SIMD architecture enables one instruction to operate on several operands simultaneously. In particular, SIMD architectures may take advantage of packing several data elements into one register or memory location. With parallel hardware execution, multiple operations can be performed with one instruction, resulting in significant performance improvement and simplification of hardware through reduction in program size and control. Some SIMD architectures perform operations in which the corresponding elements in separate operands are operated upon in parallel and independently.
Reference is now made to FIG. 2, which is a diagram illustrating an architecture similar to FIG. 1, but depicting a plurality of ALUs 16, 18, 20, and 22. Such an architecture is efficient in many SIMD applications. For efficient operation in such an architecture, data is organized in the register file 12 such that operands (or other associated data) can be readily loaded (in parallel) into the various ALUs in the same clock cycles.
Notwithstanding the improved efficiency realized by the architecture of FIG. 2, further improvements to this architecture are desired.