1. Technical Field
The present invention generally relates to the processing of instructions in a microprocessor, and more particularly, to implementing an extended register set for one or more classes of instructions in a microprocessor.
2. Description of the Related Art
In modern microprocessors, increases in latencies have been an increasingly severe problem. These increases are occurring both for operations performed on the chip, and for memory access latencies. There are a number of reasons for this phenomenon.
One reason is the trend to achieve performance increases by using higher clock frequencies. This leads to deeper pipelining (i.e., the division of a basic operation into multiple stages) and, hence, a larger number of total stages, as an operation is divided into ever smaller units of work to achieve these high frequencies.
Yet another reason relates to the differences in chip and memory speeds. That is, while chip speeds have been increasing, memory speed has been increasing at a much smaller rate. Thus, in terms of processor cycles to access a memory location in memory, latency has increased significantly. The relatively faster increase in chip speed is due to both the above-mentioned deeper pipelining, and to CMOS scaling used as a technique to increase chip speeds, as disclosed by R. H. Dennard et al., in “Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, SC-9, pp. 256-68, 1974, which is incorporated by reference herein.
Moreover, another reason relates to differences in wire and logic speeds. That is, as CMOS scaling is applied ever more aggressively, wire speeds do not scale at the same rate as logic speeds, leading to a variety of latency increases, e.g., by increasing the time required to write back an operation's results.
In addition to aggressive technology scaling and deep pipelining, computer architects have also turned to the use of more aggressive parallel execution by means of superscalar instruction issue, whereby multiple operations can be initiated in a single cycle. Recent microprocessors such as the state-of-the art Power 5 or PowerPC 970 processor can dispatch 5 operations per cycle and initiate operations at the rate of 7 and 9 operations per cycle, respectively.
To continue improving the performance of microprocessors, two challenges are of significance: namely achieving high levels of parallelism and tolerating increasing latency (in terms of processor cycles) of memory. Both achieving higher parallelism and tolerating longer latency require that programs are compiled so as to simultaneously use more independent strands of computation. This, in turn, requires a large number of registers to be available to support the multiple independent strands of computation by storing all of their intermediate results.
A result of the ability to execute more instructions in pipelines with increasing latency, and to initiate execution in multiple pipelines, requires ever-larger amounts of data to be maintained by a processor, to serve as inputs or to be received as results of operations. To accomplish this, architects and programmers have two options: retrieve and store data in a memory hierarchy; or in on-chip register file storage.
Of these choices, register file storage offers multiple advantages, such as higher bandwidth and shorter latency, as well as lower energy dissipated per access. However, the number of registers specified in architectures has not increased since the introduction of RISC computing (when the size of register files was increased from a customary 8 or 16 registers to 32 registers) until recently. Thus, as the demands for faster register storage to buffer input operands and operation results from an increasing number of instructions simultaneously being executed is growing, the number of architected registers has stayed constant, while the performance of memory hierarchies has de facto decreased in terms of processor cycles to provide data to the processor core.
To show how the effectiveness of register files has diminished, in light of changes to processor architecture that have occurred in response to technology shifts, consider the following simple ratios. About 15 years ago (circa 1990), a processor would typically have one floating point pipeline, with about 3 computational pipeline stages, plus typically an additional cycle for register file access. When processing Fused Multiply and Add (FMA) operations, i.e., merged floating point multiply-add high performance computation primitives, a four stage pipeline would have 4 FMA operations simultaneously in flight, each requiring 3 input registers and one output register, for a total of 16 registers to support all these computations in flight, leaving an additional 16 registers to hold other data and/or constants. Considering the parallelism provided by state-of-the-art microprocessors (e.g., the PowerPC 970 provides two floating-point pipelines) coupled with the latencies incurred by deep pipelining, a number of registers well in excess of the 32 registers provided by the PowerPC architecture are required to exploit the peak execution rate provided by a modern microprocessor.
Similarly, in that historic timeframe, a second level cache could be accessed with a 3 (processor) cycle hit latency, giving a ratio of about 10 registers per cycle of L2 cache access latency. This is a conservative measure; to express the actual amount of data required to be maintained in the register files in order to decouple memory access from computational, one would need to determine the number of operands consumed during such time, which scales up with issue width. Still, today, with a 10 to 12 cycle latency to L2, one could expect to see a requirement for 100 to 120 registers.
Large numbers of registers are in fact built, e.g., both the Power4 and Power5 microprocessors have well in excess of 32 registers. However, to exploit such larger register files, complex and area intensive renaming logic and out-of-order issue capabilities are required. Even then, the inability to express the best schedule in the program using a compiler or a skillfully tuned Basic Linear Algebra Subprogram (BLAS) or other such library limits the overall performance potential.
Some current microprocessors implement a technique called register renaming, whereby the limited number of architected registers is translated to use more registers internally. However, while this allows for an increase of the number of registers, register renaming is complex and incurs additional steps in the instruction processing of microprocessors. Thus, what is required to address the challenges in modern microprocessor design is an increased number of registers which are easy to access using an extended name space in the architecture, as opposed to techniques such as register renaming used in high-end microprocessors such as the IBM PowerPC 970 and Power5.
Recently, the IA-64 architecture and the CELL SPU architectures have offered implementations with 128 registers. In reference to these implementations, the IA-64 offers an implementation using instruction bundles, a technique to build instruction words wider than a machine word. While this resolves the issue of instruction encoding space, it leads to inefficient encoding due to a reduction of code density because an instruction word disadvantageously occupies more than a single machine word, thereby reducing the number of instructions which can be stored in a given memory unit.
Recent advances in the encoding instruction sets disclosed in the U.S. patent application to Altman et al., entitled “Method and Apparatus to Extend the Number of instruction Bits in Processors with Fixed Length Instructions in a Manner Compatible with Existing Code”, U.S. patent application Ser. No. 10/720,585, filed on Nov. 24, 2003, which is commonly assigned and incorporated by reference herein, advantageously allow wide instruction words to be used in conjunction with fixed size word instruction set architectures having an instruction format requiring only a single machine word for most instructions. While this offers a significant advantage over prior wide-word bundle-oriented instruction sets in terms of code density, decoding complexity is increased.
In an advantageous implementation of fixed width 32 bit instruction words, the CELL SPU instruction set architecture supports the specification of 128 registers in a 32 bit instruction word, implementing a SIMD-ISA in accordance with the U.S. patent application to Gschwind et al., entitled “SIMD-RISC Microprocessor Architecture”, U.S. patent application Ser. No. 11/065,707, filed on Feb. 24, 2005, and U.S. Pat. No. 6,839,828 to Gschwind et al., entitled “SIMD Datapath Coupled to Scalar/vector/Address/Conditional Data Register File With Selective Subpath Scalar Processing Mode”, which are commonly assigned and incorporated by reference herein.
While the SPU advantageously offers the use of 128 registers in a fixed instruction word using a new encoding that, in turn, uses fields of 7 adjacent bits in a newly specified instruction set, legacy architectures are not without deficiency. For example, since many bit combinations have been assigned a meaning in legacy architectures, and certain bit fields have been aside to signify specific architectural information (such as extended opcodes, register fields, and so forth) legacy architectures offer significant obstacles to encoding new information. Specifically, when allocating new instructions, the specification for these new instructions cannot arbitrarily allocate new fields without complicating the decoding of both the pre-existing and these new instructions.
Additionally, the number of bits in instruction sets with fixed instruction word width limits the number of different instructions that can be encoded. For example, most RISC architectures use fixed length instruction sets with 32 bits. This encoding limitation is causing increasing problems as instruction sets are extended. For example, there is a need to add new instructions to efficiently execute modern applications. Primary examples are multimedia extensions such as Intells MMX and SSE2 and the PowerPC VMX extensions. Moreover, the number of cycles required to access cache and memory is growing as processor frequencies increase. One way to alleviate this problem is to add more registers to the processor to reduce the number of loads. However, it is difficult or impossible to specify additional registers in the standard 32-bit RISC instruction encoding.
The most common solution to this problem is an approach typically associated with CISC architectures, which allows multiple instruction lengths, not a fixed size such as 32 bits. This variable length CISC approach has several problems, and was one of the reasons RISC was developed in the 1980s. Among the problems with variable length CISC encoding is that it complicates instruction decode, adding pipeline stages to the machine or reducing frequency. Moreover, another problem with variable length CISC encoding is that it allows instructions to span cache line and page boundaries, complicating instruction fetch, as well as virtual address translation. Further, another problem with variable length CISC encoding is that such a CISC approach cannot be compatibly retrofitted to a RISC architecture. Most specifically, architectures having fixed length instructions today assume pervasively that all instructions are aligned on the boundary, that branch addresses are specified at a multiple of a fixed length instruction, and so forth. Further, no mechanisms are defined how to address the issue of page-spanning instructions, and so forth.
A second solution to the problem would be to switch to widening all instructions to a wider format, preferably a multiple of the original instruction set. For typical 32 bit RISC instruction sets, the next multiple is 64-bit instructions. However, if all instructions are 64-bits, approximately twice as much space as is currently used would be required to hold instructions. In addition, this would not be compatible with existing RISC code with 32-bit instructions. If 32-bit and 64-bit instructions are intermixed, the instruction set becomes CISC-like with variable width instructions, and with the associated problems just described.
Another solution to the encoding problem is employed by the IA-64 architecture from INTEL and HEWLETT PACKARD. The IA-64 packs 3 instructions in 16 bytes, for an average of 42.67 bits per instruction. This style of encoding avoids problems with page and cache line crossings. However, it “wastes” bits specifying the interaction between instructions, for example “stop bits” are used to indicate if all three instructions can be executed in parallel or whether they are to be executed sequentially or some combination of the two. The 3 instruction packing also forces additional complexity in the implementation to deal with three instructions at once. Finally, this 3 instruction packing format has no requirement to be compatible with existing 32-bit instruction sets, and there is no obvious mechanism to achieve compatibility with 32-bit RISC encodings.
All instruction bundles in this encoding are located at multiples of the bundle size.
A number of approaches have been disclosed to address this increasingly severe problem.
U.S. Pat. No. 6,157,996 to Christie et al., entitled “Processor Programably Configurable to Execute Enhanced Variable Byte Length Instructions Including Predicated Execution, Three operand Addressing, and Increased Register Space”, which is incorporated by reference herein, teaches the use of a prefix byte to extend instruction semantics to include at least one of predicate information, extended register specification, and a third register operand. This implementation is undesirable for fixed instruction width RISC processors, as extension bytes cannot readily be accommodated in the instruction stream of a fixed width instruction set architecture.
U.S. Pat. No. 6,014,739 to Christie, entitled “Increasing General Registers in X86 Processors”, which is incorporated by reference herein, discloses that an extra byte is extended in a variable instruction set to provide additional encoding bits. This implementation is undesirable for fixed instruction width RISC processors, as extension bytes cannot readily be accommodated in the instruction stream of a fixed width instruction set architecture.
U.S. Pat. No. 5,822,778 to Dutton et al., entitled “Microprocessor and Method of Using a Segment Override Prefix Instruction Field to Expand the Register File”, which is incorporated by reference herein, discloses a microprocessor with expanded functionality within an existing variable length instruction set architecture. The control unit detects the presence of segment override prefixes in instruction code sequences executed in flat memory mode and uses prefix values to select a bank of registers. Those skilled in this and related arts will understand that the cost of decoding a prefix, determining the mode and the bank field, accompanied by fetching the instruction being modified by the prefix, incurs a significant complexity, delay and hardware inefficiency. In particular, the decoding of the prefix and bank selector has to be performed early, leading to additional complexity. In addition, prefixes cannot be readily employed in an architecture supporting only a fixed instruction word width.
Another non-transparent use of segment register override prefix bytes may be embodied within an instruction decode/execution unit. Decode/execution unit reads instructions, and operates on operands in a registers specified in the instruction. In this implementation, it is described that segment register override prefix bytes are used by a control unit to select one of multiple register banks which store the operands to be operated on by the decode/execution unit. Each register bank includes the full complement of x86 registers. In this manner, the register set of the architecture may be expanded without changing the instruction encodings. As will be appreciated by those skilled in this and related arts, a larger register set allows more operand values to be held in registers (which may be accessed quickly) and, thus, accesses to memory (which typically require a longer period of time) are lessened. In one implementation, no segment register override prefix byte is used to specify the first bank of registers, a segment register override prefix byte indicating the FS segment register specifies a second bank of registers, a segment register override prefix byte indicating the GS segment register specifies a third bank of registers, and a segment register override prefix byte indicating the ES segment register specifies a fourth bank of registers. In another implementation, the value stored within the selected segment register is used to select the appropriate register bank from numerous register banks.
In accordance with the preceding description relating to the other non-transparent use of segment register override prefix bytes embodied within an instruction decode/execution unit, all operands for a given instruction have to be retrieved from a common bank selected by the prefix selector, specified within the prefix selector in an alternate implementation. Using the segment selector as a bank selector for all operands of a given instruction is undesirable because it requires access to a control register to identify a bank, and restricts all instructions to have operands coming from just a single bank, leading to inefficient register allocation. Thus, if a common value has to be combined with other operands residing in multiple banks, copies of the common value have to be maintained, computed and updated in all banks, such that they can be combined with the other operands residing in the other banks, leading to inefficient register usage due to data duplication, and inefficient performance profile due to the duplication of work to compute the common value in all banks. It is to be appreciated that the preceding implementation has to be programmed liker a clustered machine, with distinct register files represented by the different banks.
U.S. Pat. No. 5,822,778 to Christie et al., entitled “Microprocessor and Method of Using a Segment Override Prefix Instruction Field to Expand the Register File”, which is incorporated by reference herein, discloses that the prefix and the bank select are decoded first, before the instruction is actually retrieved. Then the instruction word is combined, and an access performed. In comparison, the wide select can start the access early, and decode additional information in parallel with the access cycle.
U.S. Pat. No. 5,768,574, to Christie et al., entitled “Microprocessor Using an Instruction Field to Expand the Condition Flags and a Computer System Employing the Microprocessor”, which is incorporated by reference herein, discloses a microprocessor that is configured to detect the presence of segment override prefixes in instruction code sequences being executed in flat memory mode, and to use the prefix value or the value stored in the associated segment register to selectively enable condition flag modification for instructions. An instruction which modifies the condition flags and a branch instruction intended to branch based on the condition flags set by the instruction may be separated by numerous instructions which do not modify the condition flags. When the branch instruction is decoded, the condition flags it depends on may already be available. In another implementation of the referenced invention, the segment register override bytes are used to select between multiple sets of condition flags. Multiple conditions may be retained by the microprocessor for later examination. The conditions that a program utilizes multiple times in a program may be maintained while other conditions may be generated and utilized.
U.S. Pat. No. 5,838,984 to Nguyen et al., entitled “Single-Instruction-Multiple-Data Processing Using Multiple Banks of Vector Registers”, which is incorporated by reference herein, discloses a digital signal parallel vector processor for multimedia applications. As disclosed therein, a single instruction multiple data processor uses several banks of vector registers. This processor uses a bank bit included in a control register to identify a primary bank, and a secondary alternate bank to be identified by a select set of instructions. This is undesirable because it requires access to a control register to identify a bank, and restricts all operations to have operands coming from just a single bank, leading to inefficient register allocation. Thus, if a common value has to be combined with other operands residing in multiple banks, copies of the common value have to be maintained, computed and updated in all banks, such that they can be combined with the other operands residing in the other banks, leading to inefficient register usage due to data duplication, and inefficient performance profile due to the duplication of work to compute the common value in all banks. It is to be appreciated that the preceding implementation has to be programmed like a clustered machine, with distinct register files represented by the different banks.
U.S. Pat. No. 5,926,646 (hereinafter the “'646 patent”) to Pickett et al., entitled “Context-Dependent Memory-Mapped Registers for Transparent Expansion of a Register File”, which is incorporated by reference herein, discloses a context dependent memory mapped register accessing device for transparent expansion of a register file in a microprocessor in a computer system. Therein, in-core registers are made available as a memory-mapped address space. While the adding of additional registers in the core to be referenced by the processor is allowed, the use of memory mapping has several disadvantages. Specifically, the disadvantages relate to the fact that register names can only be properly resolved after the address generation phase, as a multitude of memory address forms can refer to a memory mapped register. This will increase the latency of access to these registers to almost the latency for first level cache access. In addition, a memory-mapped register can only be referenced for those instructions that have operand forms allowing memory accesses. This typically represents only a subset of operations, and often only a subset of operands therein. This limitation is particularly severe for RISC processors, which can only reference memory operands in load and store operations, imposing the additional cost of performing copies from the memory-mapped in-core registers to computationally useable operand registers.
In other disadvantageous aspects of the '646 patent, when addresses are generated before address generation from a subset of “preferred forms”, address aliasing can occur and lead to incorrect program execution. In another disadvantageous aspect of the '646 patent, when an address to such in-core register is added to a linked list, and accessed by a remote processor, this will lead to data coherence inconsistencies. Alternatively, costly methods for accessing such registers from symmetric multiprocessing (SMP) remote nodes have to be implemented and provided.
U.S. Pat. No. 6,154,832 to Maupin, entitled “Processor Employing Multiple Register Sets to Eliminate Interrupts”, which is incorporated by reference herein, discloses a processor which assigns a specified register set for default task and other sets for different interrupt source. While this extends the number of registers implemented in the processor, such an approach is not suitable for the extension of the register set useable by a single process or program.
U.S. Pat. No. 5,737,625 (hereinafter referred to as the “'625 patent”) to Jaggar, entitled “Selectable Processing Registers and Method”, which is incorporated by reference herein, discloses a high performance memory register selection apparatus which has a controller responding to a selection-word to control a circuit to select registers depending on the control field of a word and the prior register selection. This is limited in that only the architected set of prior art registers can be accessed at any one time, thereby not making more than the number of prior art registers available at any one time.
In another disadvantageous aspect of the '625 patent, additional instructions are required in the instruction stream to update the control word. In typical implementations, these updates will have to be made context synchronizing, i.e., no operations before the update may have outstanding references, nor can any instruction occurring in the instruction stream be dispatched until the control register update has completed. In one non-synchronizing aspect of an implementation, multiple rename versions of the control register have to be maintained, disadvantageously leading to design complexity, and high area and power usage.
U.S. Pat. No. 5,386,563 to Thomas, entitled “Register Substitution During Exception Processing”, which is incorporated by reference herein, discloses a data processing system operable in either main or exception processing mode. In accordance with the invention, the CPU restores data stored in a saved processing status register, to another register upon leaving exception-processing mode. While this extends the number of registers implemented in the processor, this is not suitable for the extension of the register set useable by a single process or program.
Microcode used for implementing microprocessor ISAs using internal layering has used a variety of formats, using contiguous or non-contiguous fields. None of these were concerned with the maintenance of cross-generational compatibility or programming orthogonality. In general, microcode has different requirements, and methods from microcode are recognized to not be applicable to architected instruction sets by those skilled in this and related arts due to issues related to the internal representation, requirements for compatibility, decoding of instructions and detection of data and structural hazards (which are not supported in the restricted microcode programming model), as well as the need of maintaining compatible across generations of a design.
Prior art instruction sets have offered the use of non-contiguous immediate constants, e.g., as disclosed by Moreno et al., in “An innovative low-power high-performance programmable signal processor for digital communications”, IBM Journal of Research and Development, Vol. 47, No. 2/3, 2003, which is incorporated by reference herein, to allow extended immediate specifications in bundle encodings, but do not address the encoding of non-contiguous fields in a fixed width instruction. The issues for immediate operand and similar fields are different because they do not require any early steering and access to determine dependences, access of register files, and so forth. In particular, this also has not required advanced decoding and/or register file access implementations. Thus, while constants have been encoded in non-contiguous ways in bundle instruction sets, the encoding of non-contiguous register file specifiers in fixed width instruction sets have eluded the inventors of this and other instruction sets.