A number of modern processor architectures, including superscalar processors and very-long-instruction-word (VLIW) processors, exploit instruction-level-parallelism within a processing task by executing multiple instructions concurrently, thus multiplying execution throughput relative to conventional scalar (sequential-execution) processors. Superscalar processors typically include dedicated hardware to dynamically determine data dependencies and schedule parallel execution accordingly. By contrast, VLIW processors generally rely on static, compile-time determination of such data dependencies and corresponding instruction parallelism, and thus trade increased compiler complexity for architectural simplicity and efficiency. As it turns out, VLIW architectures are extremely well suited to processing the high-bandwidth streams of data that pervade modern computing and consumer electronics applications.
FIG. 1A illustrates a prior-art VLIW processor architecture 100, including multiple function units 1001-1003 coupled to a shared operand register file 103(ORF). As shown in FIG. 1B, a VLIW instruction 115 typically specifies explicitly the operation (i.e., in operation codes op1, op2, op3) to be performed in parallel by each function unit 101 (this is in contrast to superscalar designs in which allocation of operations to individual function units is determined during program run-time and generally hidden from compiler/programmer view) as well as the register-file addresses of operands (radr0-radr2) and the operation outputs (wadr0, wadr1). The operation set of register file addresses and operation code provided to each execution unit is referred to as a slot or instruction slot within the overall VLIW instruction. In the VLIW instruction 115, for example, instruction slots 1161-1163 are provided to function units 1011-1013, respectively during a given execution cycle (also referred to herein as an operation cycle). In each execution cycle, each function unit 101 receives a set of operands 102 from the shared register file 103, performs instructed operations thereon, and outputs operation results 104 (although result output may be latent by a number of execution cycles as shown in FIG. 1B, according to the depth of the execution pipeline), all in accordance with the read addresses, operation code and write addresses provided in the instruction slot for that function unit. To accommodate this parallel operation, the shared operand register file 103 must generally be multi-ported, having a separate access port for each operand to be retrieved and each result to be written back. In practice, the need for multiple access ports per function unit results in quadratic growth of the operand register file 103 as the number of function units 101 is increased (each new port into the register file 103 typically requires an additional set of word lines, bit lines and access transistors to access the storage cells of the selected register, and the storage elements themselves, commonly implemented by inverter-based latch elements, also must usually grow to accommodate the additional output loading), thus limiting the number of function units 101 that can practicably be supported.
To counter scalability limitations inherent in the shared register file approach, distributed register file architectures, such as the architecture 120 shown in FIG. 2A, have been developed. In a distributed register file architecture, the shared register file is, in effect, decomposed into multiple smaller operand register files 1251-1253, each of which is distributed among, and dedicated to supplying operands 124 to, a respective one of function units 1231-1233. A switch circuit 127 (i.e., cross-bar switch or other switch fabric) is provided to enable operation results 126 from each of the function units 123 to be stored in any of the distributed register files 125.
The scalability achieved in the distributed register file architecture 120 comes at the cost of increased data access overhead for the individual function units 123. More specifically, if an operand needed by a given function unit is stored in a remote register file 125 instead of the local, directly-connected register file (a condition which may arise due to resource conflict in accessing a given register file, necessitating storage of the operation result temporarily in a different register file), an execution cycle typically must be wasted within the remote function unit (i.e., the function unit directly connected to receive operands from the remote register file) to copy the operand via switch 127 to the local register file, an operation referred to herein as a register-to-register copy. Thus, the increased scalability of the distributed register file architecture comes at the cost of performance penalty as execution intervals (operation cycles) are occasionally consumed in performing register-to-register copy operations. FIGS. 1B and 2B illustrate this performance tradeoff in terms of instruction execution sequences for the shared register file architecture of FIG. 1A and the distributed register file architecture of FIG. 2A, respectively. As shown, instruction slots iw0_s1-iw4_s1 (i.e., instruction slots directed to function unit 1011) may be executed in sequence (in respective operation cycles as shown) and without interruption in the shared-register-file architecture of FIG. 1A (producing corresponding results rslt0_s1-rslt4_s1 some time later, according to pipeline depth within the function units 101), while that same instruction word execution sequence must be interrupted in the distributed register file architecture 120 to perform a register-to-register copy (“r2r”), the copy operation consuming an operation cycle to produce an operand needed by one of the function units at the output of another (i.e., at the output of the function unit coupled to the register file containing the desired operand) and to write the needed operand write back to the appropriate register file.
Referring again to FIG. 2A, the tradeoff between performance and register file size may be extended even further in a fully-distributed register file architecture in which each of the distributed register files 125 is further decomposed into a set of even smaller dual-ported register files 1311-1313, with each dual-ported register file 131 dedicated to a respective operand input of a particular function unit 123 and having additional switching circuitry 133 to access an address-selected one of the register files 131 and an address-selected one of registers 1350-135N-1 within the selected register file 131. While the fully-distributed register file architecture tremendously simplifies the register file implementation and saves precious die area that may be re-allocated to implementation of additional function units and associated register files, the register-to-register copy overhead becomes substantial, with such operations consuming up to 20% or more of the execution pipeline in some applications.