This invention relates to computers and computer system central processing units especially as to methods and structure for handling and storage of operand values.
Central processing units for server computers, workstations and personal computers today typically employ superscalar microarchitectures that will likely be employed in embedded processors as well. These machines aim to execute more than one instruction in each clock cycle. To increase their execution rates many of the present-day designs execute instructions out-of-order. They search through the predicted stream of upcoming instructions to find those that can be started. In such out-of-order schemes an instruction needing results that will be generated by another, incomplete instruction can be deferred in favor of instructions whose input operands are ready for immediate use.
Contemporary computer designs often include means for renaming operands (especially register operands) so that instructions needing the latest value of an operand being computed by a previously issued instruction can access the immediate output of that previous instruction instead of waiting for the value to be stored (e.g. into a register file) and re-fetched. These prior art computer processors fail to gather much information about operands and the flow of operand values and then discard all or most of the information they do gather.
Continuing development of electronics technology has been allowing designers to incorporate more circuits into each computer processor. With ever more circuits it would be advantageous to be able to issue more instructions in each clock cycle but several barriers are encountered in attempting straightforward extensions to today""s superscalar methods. To manage more in-process instructions requires larger windows of instructions from which to choose and the complexity of larger windows increases faster than increases in window size.
Among the most severe barriers to increasing the number of instructions executed per clock cycle is multiporting of operand storage. Operands are stored in register files and in memory units. To make up for the relative slowness of large-scale memory units such as DRAMs, faster but smaller cache memories are connected to central processing units. A typical computer instruction might have two input operands and one output operand. Executing such an instruction would require at least three operand accesses. Executing four instructions in each clock cycle would typically require twelve operand accesses. These operands would be typically spread between register storage (xe2x80x9ca register filexe2x80x9d) and data cache storage. Each simultaneous access to a storage unit will require a read or write port. Unfortunately, the number of components required to construct access ports grows much faster than the number of ports supplied. Doubling the number of ports to the register file might require quadrupling the number components devoted to access ports. Increasing the number of ports is also likely to increase the access time to the operand storage being ported. xe2x80x9cFor a register file, doubling the number of ports doubles the number of wordlines and bitlines (quadrupling the register file area in the limit . . . xe2x80x9d [Farkas, Keith I., Norman P. Jouppi and Paul Chow xe2x80x9cRegister file design considerations in dynamically scheduled processorsxe2x80x9d, p. 18 WRL Research Report 95/10, Digital Western Research Laboratory, Palo Alto]
My U.S. Pat. No. 5,974,538 explains how computer instruction operands can be annotated as to their source (value creating) instructions and operand flows to receiving (target) instructions can then be mapped. This is shown in FIG. 1 where an output 106 of an instruction 101 has been annotated with the address of instruction 101. Subsequent use of operand 106 (R1) by an instruction 103 causes creation of a flow mapping 104 indicating that output of instruction 106 will flow to instruction 103. Subsequent executions of source instruction 101, whose flow has been mapped, can initiate forwarding of operands to target instructions so that they may make use of them as inputs and can trigger those receiving instructions to begin execution earlier than would occur in sequential execution of the same program code.
FIG. 2A and FIG. 2B show the mapping storage structure given in U.S. Pat. No. 5,974,538. FIG. 2A shows mapping information stored in a linked list data structure while FIG. 2B shows a hashed structure with linked overflow. A design might, in that disclosure, sometimes choose to omit mapping some flows from source instructions to operand target instructions and, where flows have been mapped, a machine might operate in speculative mode where operands are forwarded to target instructions before all intervening branch paths from source instruction to target instruction have been resolved.
U.S. Pat. No. 5,974,538 also discusses the use of a Temporary Result Cache [C29 L40] to decrease traffic to architected storage locations for values that will soon be overwritten with newer values. This cache is, however, concerned with holding outputs of instructions until those values have been superceded to avoid materializing them. It is Not Concerned with holding operand values to be forwarded to other instructions as discussed here.
A similar scheme was put forth for memory operands in a paper by Moshovos and Sohi [Moshovos, Andreas and Gurindar Sohi, xe2x80x9cStreamlining inter-operation memory communication via data dependence predictionxe2x80x9d, Proc. Micro-30, December, 1997, IEEE]. In that scheme dependences on memory operands are first detected by annotating memory operands in a small annotations file called a Dependence Detection Table that records the program counter (address) of the last instruction to last touch each recorded address. Part of the system of Moshovos and Sohi is depicted in FIG. 3. A store instruction 302 stores a new value in a memory hierarchy 301 at a storage address 303. The value at that storage address is later used as input by a load instruction 304 that loads the value to a register from where it may be used by other subsequent instructions. The passing of a value from a store instruction to a load instruction causes creation of an association between those instructions that is stored in an association record 307 in a dependence prediction and naming table 312. Later execution of the store instruction will create an entry 311 in a synonym file 310. When dependent load instruction 304 is issued it can obtain its value from the synonym file instead of having to wait for a memory operand address calculation and cache or memory access to complete. Moshovos and Sohi also describe a transient value cache that can hold values likely to be killed (overlaid) soon with new values for the same address. The methods of Moshovos and Sohi are intended for use in a speculative superscalar machine in which instructions are Speculatively issued before all preceding conditional branch instructions have been resolved. The machine will, at times, execute down the wrong path of instructions and have to abandon some of its results. Dependences recorded in the synonym associations of Moshovos and Sohi also include a Predictor that is used to determine whether forwarding should occur between a given pair of instructions. The Predictor described in Moshovos and Sohi is intended only to reflect the likelihood of operand dependence between the instructions. That paper proposed predictors not just of dependencies from store instructions to load instructions but also between load instructions. An instruction that loads a value from a given memory location might be followed by another load instruction that loads the contents of that same memory location. There is then a read-after-read (RAR) dependence between the two load instructions and a system can take advantage of that dependence to bypass the expense of the second load. The Predictor is still concerned with predicting true dependences between instructions as opposed to any other functions.
In a similar paper Tyson and Austin [Tyson, Gary S. and Todd M. Austin, xe2x80x9cImproving the accuracy and performance of memory communication through renamingxe2x80x9d, Proc. Micro-30, December, 1997, IEEE] also describe forwarding of memory values and use of dependency predictors. As depicted in FIG. 4 Values 407 are stored in a Value File 406 from which load instructions speculatively retrieve the values they need without having to await calculation of memory operand addresses. A Store/Load Cache 401 provides indexes 403 and 405 into the value file based on the instruction (PC) addresses of a store instruction 402 and a load 404 instruction.
The last two methods outlined above attempt to Predict whether there are operand dependencies between selected instructions and record confidence information about probable dependence relationships. U.S. Pat. No. 5,974,538 maps actual dependences and may elect not to map some dependences due to distance or uncertainty. None of the above methods provides means to classify operands by use (load vs. store) nor by time proximity of next use. With no means of classification there can be no optimization of operand forwarding or storage. All operands are treated as equal. There is only dependence or non-dependence between Instructions. Storing operands in only one or two possible, Centralized storage structures will lead to bottlenecks as computer designers attempt to scale to much higher instruction issue and retirement rates. Scaling instruction rates with centralized storage structures requires increasingly more ports into those storage structures and such centralized storage cannot be distributed among multiple, loosely connected execution or retirement units.
A paper by Tam et al [Tam, Edward, et al, xe2x80x9cActive management of data caches by exploiting reuse informationxe2x80x9d, IEEE Transactions on Computers 48, No. 11, November, 1999, IEEE] analyses several multilateral caching systems. In these schemes, memory operands are classified into two different classes that are stored in two different (multilateral) caches based on behavior of the operands while they were in cache. Operands are sought in multiple caches in parallel to minimize access time. Parallel access to the multiple caches is required because no record is made of an operand""s current cache location(s) or, in the case of the original Victim Cache, a second cache is searched when a required operand is not located in the primary data cache. These methods are not scalable to large numbers of execution units because the cache structures are centralizedxe2x80x94large numbers of execution units accessing the same centralized cache stores would require non-scalable degrees of read and write multiporting.
Loop Operations.
Much of the work done by computer programs is done in loops. U.S. Pat. No. 5,974,538 details an elaborate mechanism for keeping track of operands in loops and in nested loops but that mechanism applies only to forwarding and combining, not to operand flow mapping and not to discovering multi-iteration loop dependencies. U.S. Pat. No. 5,974,538 also shows a means for simultaneous execution of multiple loop iterations. U.S. Pat. No. 5,974,538 treats forwarding of operands from one loop iteration to a subsequent loop iteration but shows no means to forward operands where the loop iteration distance is more than one. Memory operands can commonly skip loop iterations as shown in a C language loop
for (i=2; i less than 100, i++)m[i]=m[i]+m [ixe2x88x922]*c[i]:
where the loop distance is two and m is an array variable in memory.
Tyson and Austin declines to map forwardings for distances greater than one: xe2x80x9cour predictors will not work with loop dependence distances greater than one, even if they are regular accesses. Support for these cases are currently under investigation.xe2x80x9d Likewise Moshovos and Sohi demurs from this task: xe2x80x9csince data addresses are calculated dynamically, the lifetimes of the dynamic dependences may overlap (as for example in the following loop that has a recurrence that spans 3 iterations: for i=1 to N do a[i+3]=a[i]+1). In this case, remembering the most recent synonym for the static dependence is not sufficient. Instead, the load has to determine which of all previous synonyms is the appropriate one. Even though support for regular communication patterns can be provided, further investigation of this issue is beyond the scope of this paper.xe2x80x9d
There are also common cases where one loop fills in or modifies values in a data structure and then those values are used by (flow to) one or more subsequently executed program loops. None of the documents referenced above teach any means to map such inter-loop flows or any way to exploit such flows at execution time.
Instruction level parallelism could be increased and program execution times decreased in highly parallel computer designs by exploiting operand flows that cross more than one intra-loop iteration (loop distance greater than one) or that flow into other, subsequent loops by forwarding those operands and starting target instructions where the forwarding has made all needed operands available. But such forwarding will require more information about operand use. Information that is discarded by today""s processor designs.
Multiple Processor Designs.
Modern mainframe, server, and workstation computer systems often include multiple central processors (CPUs) in a Symmetric Multi-Processing (SMP) arrangement where main memory (typically semiconductor-based) is shared but one or more levels of fast cache memory are not shared. It is a problem in SMP computer systems to keep the private caches coherent so that all processors will see a unique memory location as having the same value. This problem is complicated by executing memory access (load or store) instructions speculatively before it is known whether the branch paths leading to those memory access instructions will actually be followed. It could be very inappropriate for a speculative load or store instruction to send cache coherence signals to other processors when a branch leading to that instruction might have been mispredicted so that the instruction""s execution will have to be nullified. It would be even more problematic if a synchronization instruction, like Compare And Swap were speculatively executed and affected execution in one of the other memory coupled processors. The latter problem would lead to incorrect operation. Speculative execution of loads and stores to memory areas that could contain operands shared across CPUs requires facilities missing from current computer designs.
Branch Prediction.
Branch prediction has been improved to the point of being very accurate, but not perfect. Predicting that a given instruction will or will not be executed requires that all intervening conditional branches between the last resolved branch and the given instruction be correct. Since conditional branches are extremely common, often between 15% and 25% of executed instructions, increasing instruction execution rates to very high levels requires increasing the number of conditional branches whose outcomes are predicted. To issue 16 instructions at each cycle may require correct predictions for four branches per cycle and the latency of the oldest unresolved branch could be several cycles so it may be necessary to predict a dozen branches. If branches are predicted with 95% accuracy then the accuracy of a dozen branches is 0.95 raised to the twelfth power, which is only a bit above 54%. So nearly half the time instructions issued in the current cycle would need to be abandoned, restarting from the first incorrect branch. A prediction method with longer range than those in current use would allow easier, more efficient increases in instruction execution rates. Information about operand flows between instructions is often a better indicator of longer-range instruction flow than is information about the history of the intervening branch instructions. Today""s typical computer processor design retains the branch history whilst discarding the operand flow history even where the operand flows show that the flows of instruction control converge to a common subsequent instruction that could be issued early to increase instruction level parallelism.
Stack Pointer Operations.
Stack data structures are useful to a number of tasks carried out by computer systems. They are used for call-return instruction addresses, for subroutine data passing and for dynamic (function/subroutine automatic) work areas. Many subroutines are quite short, comprising only a few instructions. Completing dozens or even hundreds of instructions for a single thread in each clock cycle will require multiple updates to a single stack pointer. However, today""s processor microarchitectures could not accommodate such high levels of instruction parallelism because they must single thread the updates to such a stack pointer. Easing this bottleneck will require retention of operand information that these prior art designs discard.
The foregoing and other objects, features, and advantages of the invention will be apparent from the following detailed description in conjunction with the drawings.
The present invention is a computer system and method for managing operands to improve their flow and storage and to better exploit their behavior during instruction execution to increase instruction execution rates. Operand behaviors are noted in a new class of operand management indicator store means: extended operand management indicators. These indicators are used to select among different classes of operand storage and among different operand forwarding modes. Operands that are read-only or are seldom changed have their values stored in K-Caches that are designed for operands that are never or nearly never updated. Such caches can be replicated without needing frequent coherence operations so that instruction execution can be distributed across computer logic areas or even onto multiple, separate parts (e.g., separate chips). Because they are seldom updated, K-caches also have fewer write ports than other caches. Load and store operations that use constants as address inputs or that repetitively reload unchanged values can be streamlined. Operand values that will be immediately used are forwarded directly to instruction execution units that will need them. Operand values that will not be needed so soon are stored in transient value cache(s) to await execution of the target instructions that will need them.
Memory operand values created in loops are annotated to identify their loop source iterations. Storage of this operand management information enables forwarding across multiple loop iterations (not limited to a distance of one) and possible forwarding to other loops having these operands as inputs.
Extended operand management information is also used to record those operands that are subject of inter-processor (SMP) cache coherence signaling and cross-CPU serialization operations so that memory sharing can be done with less delay.
Other extended operand management indicators are employed in extending the range of branch prediction and to speculatively forward operands to receiving instructions conditional upon prior branch paths.
Extended operand management indicators are also used to streamline push and pop stack-addressing operations.