1. Field of the Invention
This invention relates to employing an instruction reorder buffer, and particularly to a technique that takes at least two processors that are optimized to execute dependence chains, and co-locate the processors with a superstructure called SuperROB (Super Re-Order Buffer).
2. Description of Background
Many processors designed today are optimized for execution of tight dependence chains. A dependence chain is a sequence of instructions in a program in which a temporally sequential instruction is data-dependent on a temporally previous instruction. Examples of key data dependence paths that processors optimize are: load-compare-branch, load-load, load-compute, and compute-compute latencies. Examples of such processors are: the PPE (Power Processing Element) core on the Sony-Toshiba-IBM Broadband Engine, the IBM Power3 core, Itanium cores from Intel®, and almost all of the modern cores implementing z/Architecture technologies.
Current research in processor technology and computer architecture is motivated primarily by the desire for greater performance. Greater performance may be achieved by increasing parallelism in execution. There are two kinds of parallelism in typical program workloads. These are Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). Some modern computer processors are specifically designed to capture ILP in programs (for example, IBM Power4 & 5, Intel Pentium), while multiprocessor systems are designed to capture TLP across threads or processes. Processor cores that are optimized to execute dependence chains are often also expected to execute ILP workloads. ILP workloads have more than one concurrent dependence chain, and overlapped execution of the chains is typically possible, provided the ILP between the chains has been exposed and exploited by the machine.
The evolution of microprocessor design has led to processors with higher clock frequencies to improve single-tread performance. These processors exploit ILP to speed up single-threaded applications. ILP attempts to increase performance by determining, at run time, instructions that can be executed in parallel. The trade-off is that ILP extraction requires highly complex microprocessors that consume a significant amount of power.
Thus, it is well known that different processor technologies utilize the ILP and TLP workloads differently to achieve greater processor performance. However, in existing ILP and TLP system architectures it is difficult to optimize the processor for both high-throughput TLP-oriented and ILP-oriented applications. It is very cumbersome to map ILP applications on one or more TLP cores. Thus, alternative processor architectures are necessary for providing ILP extraction on demand, for allowing global communication, for allowing efficient ILP exposition, extraction, and exploitation, and for efficiently operating across a plurality of TLP cores.