1. Field of the Invention
The present invention relates to processors. More specifically, the present invention relates to architectures for Very Long Instruction Word (VLIW) processors.
2. Description of the Related Art
One technique for improving the performance of processors is parallel execution of multiple instructions to allow the instruction execution rate to exceed the clock rate. Various types of parallel processors have been developed including Very Long Instruction Word (VLIW) processors that use multiple, independent functional units to execute multiple instructions in parallel. VLIW processors package multiple operations into one very long instruction, the multiple operations being determined by sub-instructions that are applied to the independent functional units. An instruction has a set of fields corresponding to each functional unit. Typical bit lengths of a subinstruction commonly range from 16 to 64 bits per functional unit to produce an instruction length often in a range from 64 to 512 bits for VLIW groups from four to eight subinstructions.
The multiple functional units are kept busy by maintaining a code sequence with sufficient operations to keep instructions scheduled. A VLIW processor often uses a technique called trace scheduling to maintain scheduling efficiency by unrolling loops and scheduling code across basic function blocks. Trace scheduling also improves efficiency by allowing instructions to move across branch points.
Limitations of VLIW processing include limited parallelism, limited hardware resources, and a vast increase in code size. A limited amount of parallelism is available in instruction sequences. Unless loops are unrolled a very large number of times, insufficient operations are available to fill the instruction capacity of the functional units. The operational capacity of a VLIW processor is not determined by the number of functional units alone. The capacity also depends on the depth of the operational pipeline of the operational units. Several operational units such as the memory, branching controller, and floating point functional units, are pipelined and perform a much larger number of operations than can be executed in parallel. For example, a floating point pipeline with a depth of eight steps has two operations issued on a clock cycle that cannot depend on any of the operations already within the floating point pipeline. Accordingly, the actual number of independent operations is approximately equal to the average pipeline depth times the number of execution units. Consequently, the number of operations needed to maintain a maximum efficiency of operation for a VLIW processor with four functional units is twelve to sixteen.
Limited hardware resources are a problem, not only because of duplication of functional units but more importantly due to a large increase in memory and register file bandwidth. A large number of read and write ports are necessary for accessing the register file, imposing a bandwidth that is difficult to support without a large cost in the size of the register file and degradation in clock speed. As the number of ports increases, the complexity of the memory system further increases. To allow multiple memory accesses in parallel, the memory is divided into multiple banks having different addresses to reduce the likelihood that multiple operations in a single instruction have conflicting accesses that cause the processor to stall since synchrony must be maintained between the functional units.
Code size is a problem for several reasons. The generation of sufficient operations in a nonbranching code fragment requires substantial unrolling of loops, increasing the code size. Also, instructions that are not full include unused subinstructions that waste code space, increasing code size. Furthermore, the increase in the size of storages such as the register file increase the number of bits in the instruction for addressing registers in the register file.
A challenge in the design of VLIW processors is effective exploitation of instruction-level parallelism. Highly parallel computing applications that have few data dependencies and few branches are executed most efficiently using a wide VLIW processor with a greater number of subinstructions in a VLIW group. However many computing applications are not highly parallel and include branches or data dependencies that waste space in instruction memory and cause stalling. Referring to FIG. 1, a graph illustrates a comparison of instruction issue efficiency and processor size as VLIW group width is varied. The left axis of the graph relates to an instruction-level parallelism plot 10 that depicts the number of instructions executed per cycle against VLIW issue width. The right axis of the graph relates to a relative processor size plot 12 that shows relative processor size in relation to VLIW issue width.
What are needed are a technique and processor architecture that increase the capacity for instruction-level parallelism while efficiently using resources so that the number of functional units kept busy in each cycle and the number of useful operations in a VLIW group are increased.
A Very Long Instruction Word (VLIW) processor has a clustered architecture including a plurality of independent functional units and a multi-ported register file that is divided into a plurality of separate register file segments, the register file segments being individually associated with the plurality of independent functional units. The functional units access the respective associated register file segments using read operations that are local to the functional unit/ register file segment pairs. In contrast, the functional units access the register file segments using write operations that are broadcast to a plurality of register file segments.
In an illustrative embodiment, independence between clusters is attained since the separate clustered functional unit/ register file segment pairs have local (internal) bypassing that allows internal computations to proceed, but have only limited bypassing between different functional unit/ register file segment pair clusters. Thus a particular functional unit/ register segment pair does not bypass to all other functional unit/ register segment pairs.
Usage of local bypassing rather than global bypassing greatly reduces the interconnection structures within the processor, advantageously reducing the length of interconnect lines, reducing processor size and increasing processor speed by shortening the distance of signal transfer. Independent clustering of functional units advantageously forms a highly scaleable structure in VLIW processor architecture using distributed functional units.
In some embodiments, a clustered functional unit/ register file segment pair also includes one or more annexes that stage or delay intermediate results of an instruction thereby controlling data hazard conditions. The annexes contain storage for storing destination register (rd) specifiers for all annex stages, valid bits, for the stages of a pipeline, and priority logic that determines a most recent value of a register in the register file.
The annexes include multiplexers that select matching stages among bypass levels in a priority logic that selects data based on priority matching within a priority level. The annexes include compare logic that compares destination specifiers of an instruction executing within the annex pipeline against source and destination specifiers of other instructions currently executing in the local bypass range of a functional unit/ register file segment pair cluster.
A multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports. The multi-ported register file that is divided into register file segments that are individually allocated among functional unit/ register file segment pair clusters. The plurality of separate and independent register files forms a layout structure with an improved layout efficiency. The read ports of the total register file structure are allocated among the separate and individual register files. The separate and individual register files have write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent.
In one illustrative embodiment, a 16-port register file structure with twelve read ports and four write ports is split into four separate and individual 7-port register files with three read ports and four write ports. The area of a single 16-port register file would have a size proportional to 16 times 16 or 256. The separate and individual register files has a size proportional to 7 times 7 or 49 for a total of 4 times 49 or 196. The capacity of a single 16-port register and the four 7-port registers is identical with the split register file structure advantageously having a significantly reduced area. The reduced area advantageously corresponds to an improvement in access time of a register file and thus speed performance due to a reduction in the length of word lines and bit lines connecting the array cells that reduces the time for a signal to pass on the lines. The improvement in speed performance is highly advantageous due to strict time budgets that are imposed by the specification of high-performance processors and also to attain a large capacity register file that is operational at high speed.