Technical Field
The present invention generally relates to processing of instructions in a microprocessor, and more particularly, to a method and apparatus for the dynamic creation of operations utilizing a wide datapath in a microprocessor.
Description of the Related Art
Modern microprocessor design faces a number of severe constraints, including non-scaling or reverse scaling of signal speeds in signal wires, the exploding power budgets associated with leakage energy, and burgeoning control complexity. The number of instructions simultaneously processed by a microprocessor is an important aspect for its architectural performance, but also for its complexity, possible operating frequency and energy consumption.
Specifically, as more instructions are being processed, storage structures must be allocated to store these instructions, resulting in increased area and thereby impacting both the leakage power and the length of signaling wires needed to transmit information. Additionally, supporting more instructions in flight entails more issue slots, more dependence checking logic, wider commit logic, and so forth. All of these increase both control complexity, and chip area to provide the needed controls.
To address these challenges, one promising solution is the use of architectures operating on wide data, wherein a single instruction word can execute on several data words simultaneously and in parallel. An example of a recent architecture exploiting pervasive data parallelism is described by U.S. Pat. No. 6,839,828; and U.S. Patent Application No. 2005/0160097 (SIMD-RISC MICROPROCESSOR). SIMD is single instruction multiple data processing wherein a single instruction operates on multiple data words.
While the introduction of a new architecture permits benefits from new pervasively data parallel instructions and operates on multiple data elements in parallel, the architecture prevents binary compatibility with previously deployed systems. An alternative is to add additional data-parallel computing elements to a microprocessor. New processor implementations can benefit from the provisioning of instructions operating on wide data, while permitting execution on legacy binaries using the base scalar instruction set.
Using the extended instruction set offers the advantage of increasing the number of operations which can be performed without increasing data structures to support an increase of the number of instructions which can be initiated and completed in a cycle, and storage structures such as instruction buffers, issue queues and commit tables used to track instructions.
While the introduction of instruction set extensions permits the adoption of advanced novel computing techniques such as data-parallel processing, adoption of such new techniques is often practically limited by the need to provide backward compatibility, wherein software developers need to ensure compatibility of an application not only with the most recent version of the architecture, but also with that of previous architecture generations.
In prior art, merging of instructions has been performed to reduce the number of memory requests, and to reduce tracking overhead by storing multiple instructions as part of a single instruction group, wherein some tracking information is only maintained on a per-group basis.
Referring now to the merging of instructions in the prior art, one form of merging includes merging multiple store requests using merging store queues. These are based on address values which are not available until after the fetch, dispatch, issuance and execution of an instruction, negating advantages provided by the present disclosure as will be discussed below. Merging store requests also does not improve computational performance of computationally bound problems and does not permit the exploitation of data-parallel execution data paths.
In accordance with the prior art, cache miss services can be combined. Again, this combining is based on address values computed by separate instructions, and by using a single wide line to satisfy multiple memory access requests, not by executing multiple operations in parallel.
The IBM POWER 4™ processor merges multiple Power Architecture™ instructions into an instruction group for efficient tracking in tables such as a GCT. (See Joel M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, B. Sinharoy, “POWER4 System Microarchitecture,” IBM Journal of Research and Development, Vol. 46, No. 1, pp. 5-26, January 2002). Instructions are independently issued and executed, needing separate space in issue queues and so forth.
A technique similar to POWER4™ group formation is used under the name micro-ops fusion to fuse micro-ops into macro-ops for tracking, as described in “The Intel Pentium M Processor: Microarchitecture and Performance”, Intel Technology Journal, Volume 07, Issue 02, May 2003. Specifically, with micro-ops fusion, the Instruction Decoder fuses two micro-ops into one micro-op and keeps them united throughout most parts of the out-of-order core of the processor-at allocation, dispatch, and retirement. To maintain their non-fused behavior benefits, the micro-ops are executed as non-fused operations at the execution level. This provides an effectively wider instruction decoder, allocation, and retirement. Similar to the prior art POWER4™ microarchitecture, ops are fused for the purpose of tracking (including renaming, dispatch and retirement), but not fused for the purpose of execution. This is clearly depicted in the Intel Technology Journal article above where it is clearly shown that execution units work in the un-fused domain.
Pajuelo, Gonzalez, and Valero describe speculative dynamic vectorization in “Speculative Dynamic Vectorization”, Proceedings of the 29th Annual International Symposium on Computer architecture, Anchorage, Ak., 2002. This technique depends on the detection of strided loop behavior, negating the performance benefits of short SIMD sequences, and requiring the provision of a full vector unit, a vector register file, and a validation engine for speculatively vectorized vector operation. This technique also does not target the creation of instructions operating on wide data (such as including, but not limited to, SIMD parallel execution), but traditional vector operations with its inherent strided access.
Because speculative dynamic vectorization is driven off strided loads, it is located in the back-end of a microprocessor pipeline, and does not reduce the number of operations which must go through the front end of the machine for fetching and validation. Thus, while this offers significant performance improvements for strided vector operations, it does not address the front end bottleneck in a satisfactory manner.