1. Field of the Invention
The present invention relates to parallel processing. More specifically, the present invention relates to the improved decoding of a configurable register file for faster initiation stages of parallel processing.
2. The Background
Parallel Processing involves the execution of multiple processes simultaneously. Numerous types of parallel processing schemes have been utilized, but a common scheme is Very Long Instruction Word (VLIW) schemes. VLIW processors use multiple, independent, functional units to execute the instructions in parallel. Generally, the multiple operations are combined into a single very long instruction. The multiple operations are determined by sub-instructions that are applied to the independent functional units.
A VLIW processor usually uses a technique known as trace scheduling to maintain a code sequence with sufficient operations to keep instructions scheduled by unrolling loops and scheduling code across basic function blocks. Trace scheduling may also improve efficiency by allowing instructions to move across branch points.
FIG. 1 is a schematic diagram illustrating a parallel processor. The processor 50 contains multiple media processor units 52, 54. Each media processor unit 52, 54 includes an instruction cache 56, an instruction aligner 58, an instruction buffer 60, a pipeline control unit 62, a split register file 64, a plurality of execution units 66, 68, 70, 72, and a load/store unit 74. The media processing units 52, 54 may use a plurality of execution units for executing instructions. The execution units 66, 68, 70, 72 may include three media functional units (MFU) 66, 68, 70 and one general function unit (GFU) 72. The MFUs 66, 68, 70 may be multiple single-instruction-multiple-datapath (MSMID) media functional units. Each of the MFUs 66, 68, 70 may be capable of processing 16-bit components. Various parallel 16-bit operations supply the dingle-instruction-multiple-datapath capability including add, multiply-add, shift, compare, and others. The MFUs 66, 68, 70 operate in combination as tightly-coupled digital signal processors (DSPs).
Each MFU 66, 68, 70 may have a separate and individual sub-instruction stream, but all the MFUs 66, 68, 70 execute synchronously so that the subinstructions lock-step through the pipeline stages.
The GFU may be a processor capable of executing arithmetic logic unit (ALU) operations, reciprocal square, and others. The GFU also may support less common parallel operations such as the parallel reciprocal square root instruction.
The instruction cache 56 may have a 16 Kbyte capacity and include hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code. Software may be used to indicate that the instruction storage is being altered when modifications are made. Coherency may be maintained by hardware that supports write-through, non-allocating caching.
The pipeline control unit 62 may be connected between the instruction buffer 60 and the functional units 66, 68, 70, 72. The pipeline control unit 62 schedules the transfer of instructions to the functional units 66, 68, 70, 72. The pipeline control unit 60 also receives status signals from the functional units 66, 68, 70, 72 and a load/store unit 74 and uses the status signals to perform several control functions. The pipeline control unit 62 maintains a scoreboard, generates stalls and bypass controls. The pipeline control unit 62 also may generate traps and maintain special registers.
Each media processing unit 52, 54 includes a split register file 64, a single logical register file. The split register file 64 is split into a plurality of register file segments 76, 78, 80, 82 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time. A separate register file segment 76, 78, 80, 82 is allocated to each of the media functional units 66, 68, 70 and the general functional unit 70. In the illustrative embodiment, each register file segment 76, 78, 80, 82 has 128 32-bit registers. The first 96 registers (0-95) in the register file segment 76, 78, 80, 82 are global registers. All the functional 66, 68, 70, 72 units may write to the 96 global registers. The global registers are coherent across all functional units (MFUs and GFU) 66, 68, 70, 72 so that any write operation to a global register by any functional unit is broadcast to all register file segments 76, 78, 80, 82. Registers 96-127 in the register file segments 76, 78, 80, 82 are local registers. Local registers allocated to a functional unit 66, 68, 70, 72 are not accessible or "visible" to other functional units 66, 68, 70, 72.
The media processing units 52, 54 are highly structured computation blocks that execute software-scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time. The operational characteristics support multiple instruction issue through a very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly. Such hardware interlocks are typically complex, error-prone, and create multiple critical paths. A VLIW instruction word includes one instruction that executes in the general functional unit (GFU) 72 and from zero to three instructions that execute in the media functional units (MFU) 66, 68, 70. A MFU instruction field within the VLIW instruction word may include an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
Speed and ease of access are often problems encountered when dealing with register files. In order to solve these problems, register files are often split. FIG. 2 is a schematic block diagram illustrating a split register file 64. The split register file 64 supplies all operands of processor instructions that execute in the media functional units 66, 68, 70 and the (general functional units 72 and receives results of the instruction execution from the execution units. The split register file 64 is the source and destination of store and load operations, respectively.
Large, multiple-ported register files are typically metal-limited so that the register area is proportional with the square of the number of ports. A sixteen port file is roughly proportional in size and speed to a value of 256. The split register file 64 is divided into four register file segments 100, 102, 104, and 106, each having three read ports and four write ports so that each register file segment has a size and speed proportional to 49 for a total area for the four segments that is proportional to 196. The total area is therefore potentially smaller and faster than a single central register file. Write operations are fully broadcast so that all files are maintained coherent. Logically, the split register file 64 is no different from a single central register file, however, from the perspective of layout efficiency, the split register file 64 is smaller and has better performance.
Splitting the register file into multiple segments in the split register file 64 in combination with the character of data accesses in which multiple bytes are transferred to the plurality of execution units concurrently, results in a high utilization rate of the data supplied to the integrated circuit chip and effectively leads to a much higher data bandwidth than is supported on normal processors.
Normal applications often fail to exploit the large register file 64 because compilers do not effectively use the large number of registers in the split register file 64. However, aggressive in-lining techniques that have traditionally been restricted due to the limited number of registers in conventional systems may be used in the processor 50 to exploit the large number of registers in the split register file 64. In a software system that exploits the large number of registers in the processor 50, the complete set of registers is saved upon the event of a thread (context) switch. When only a few registers of the entire set of registers is used, saving all registers in the full thread switch is wasteful. Waste is avoided in the processor 50 by supporting individual marking of registers. Octants of the thirty-two registers can be marked as "dirty" if used, and are consequently saved conditionally.
These multiport register files can have large delays when accessed for read or write operations. When a register address is specified, it must be decoded by decoding circuitry. What is needed is a way to simplify the decoding circuitry such that register cells may be accessed with less delay and still incorporating all the functionality required without any additional logic.