Certain VLSI processor architectures now group execution units as clusters to process bundled instructions. One “bundle” of instructions has three instructions; a cluster operates to process one bundle, or more, of instructions. FIG. 1 illustrates the prior art by showing a processor register file 10 coupled with clusters 12(1), 12(2) . . . 12(N). Each cluster 12 is a physical logic unit that includes multiple pipelines, with bypassing (e.g., as shown in FIG. 2), to parallel process the multiple instructions within the bundles. The advantages of clusters 12 lie primarily in timing efficiencies. Each cluster 12 more quickly processes one bundle as compared to two bundles; appropriate use of clusters may reduce bypassing requirements within processor architectures. However, a loss of performance is also realized in cluster-based architectures when information is shared between clusters, as at least one cycle latency results from moving data between them.
Certain VLSI processor architectures also use “multi-threading” techniques to process instructions through pipeline stages. FIG. 2 shows one exemplary multi-threading architecture 20 of the prior art. Architecture 20 illustratively has two program counters 22(1), 22(2), an instruction fetch unit 24, a multiplexer 26, a plurality of pipelines 28(1), 28(2) . . . 28(N), bypass logic 30, and register file 10. Multiple program counters 22 provide for the multiple program “threads” through pipelines 28; as any one instruction stalls, another instruction may proceed through pipelines 28 to increase collective instruction throughput. As known in the art, each counter 22 is a register that is written with the address of the next instruction at the end of each instruction fetch cycle in the pipeline; each pipeline 28 includes multiple execution stages such as fetch stage F, the decode stage D, the execute stage E, and the write-back stage W. Individual stages of pipelines 28 may transfer speculative data to other execution units through bypass logic 30 and multiplexer 26 to reduce data hazards in providing data forwarding capability for architecture 20. Register file 10 is typically written to, or “loaded,” at the write-back stage W on logic lines 32.
The invention advances the state of the art in processing architectures incorporating logic such as shown in FIG. 1 and FIG. 2 by providing methods and systems for processing multi-thread instructions through clustered execution units. Several other features of the invention are apparent within the description that follows.