1. Technical Field
The present invention relates generally to computers and, more particularly, to methods for generating code for an architecture encoding an extended register specification.
2. Description of the Related Art
In modern microprocessors, increases in latencies have been an increasingly severe problem. These increases are occurring both for operations performed on the chip, and for memory access latencies. There are a number of reasons for this phenomenon.
One reason is that the trend to achieve ever higher performance results in increased use of high clock frequencies. This leads to deeper pipelining (i.e., the division of a basic operation into multiple stages) and, hence, a larger number of total stages as an operation is divided into ever-smaller units of work to achieve these high frequencies.
Another reason relates to the differences in chip and memory speeds. That is, while both chip and memory speeds have been increasing, memory speed has been increasing at a much smaller rate. Thus, in terms of processor cycles to access a location in memory, latency has increased significantly. The relatively faster increase in chip speed is due to both the above-mentioned deep pipelining, and to CMOS scaling used as a technique to increase chip speeds, as disclosed by R. H. Dennard et al., in “Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, SC-9, pp. 256-68, 1974, which is incorporated by reference herein.
Moreover, another reason for the increasing latencies relates to differences in wire and logic speeds. That is, as CMOS scaling is applied ever more aggressively, wire speeds do not scale at the same rate as logic speeds, leading to a variety of latency increases, e.g., increasing the time required to complete operations by requiring longer time to write back their results.
In addition to aggressive technology scaling and deep pipelining, computer architects have also turned to the use of more aggressive parallel execution by means of superscalar instruction issue, whereby multiple operations can be initiated in a single cycle. Recent microprocessors such as the state-of-the art Power5 or PowerPC 970 processor can dispatch 5 operations per cycle and initiate operations at the rate of 7 and 9 operations per cycle, respectively.
To continue improving the performance of microprocessors, there are two challenges of significance: achieving high levels of parallelism; and tolerating increasing latency (in terms of processor cycles) of memory. Both achieving higher parallelism and tolerating longer latency requires programs to be compiled to use more independent strands of computation simultaneously. This, in turn, requires a large number of registers to be available to support the multiple independent strands of computation by storing all of their intermediate results.
A result of the ability to execute more instructions in pipelines with increasing latency, and to initiate execution in multiple pipelines, requires ever-larger amounts of data to be maintained by a processor in order to serve as inputs or to be received as results of operations. To accomplish this, architects and programmers have two options, namely retrieve and store data in a memory hierarchy, or in on-chip register file storage.
Of these choices, register file storage offers multiple advantages, including higher bandwidth and shorter latency, as well as lower energy dissipation per access. However, the number of registers specified in architectures has not increased since the introduction of RISC computing (when the size of register files was increased from the then customary 8 or 16 registers to 32 registers) until recently. Thus, as the demands for faster register storage grow, to buffer input operands and operation results from an increasing number of instructions simultaneously being executed, the number of architected registers has stayed constant while the performance of memory hierarchies has de facto decreased (in terms of processor cycles required to provide data to the processor core).
To show how the effectiveness of register files has diminished in light of changes to processor architecture that have occurred in response to technology shifts, consider the following simple ratios. Approximately 15 years ago, circa 1990, a high-end processor would typically have one floating point pipeline, with about 3 computational pipeline stages, plus an additional cycle for register file access. When processing FMA operations, i.e., merged floating-point multiply-add high performance computation primitives, a pipeline would have 4 FMA operations in flight (one per pipeline stage), each requiring 3 input registers and one output register, for a total of 16 registers to support all computations in flight. Given the typical complement of 32 floating-point registers, this would leave an additional 16 registers to hold other data and/or constants. Considering the parallelism provided by state-of-the-art microprocessors, coupled with the latencies incurred by deep pipelining, the number of registers required to supply and store sufficient operands to exploit the peak execution rate provided by a modern microprocessor is well in excess of the 32 floating-point registers typically provided in the instruction set architecture.
Similarly, in past machines, a second level cache could be accessed with a 3 cycle hit latency, which gives a ratio of about 10 registers available per cycle of latency to L2 cache (i.e., 32 registers divided across the 3 access cycles). This is a conservative measure; the actual number of registers required to completely cover an L2 cache access (and therefore to decouple memory access from computational use) in a more realistic scenario would depend on the actual number of operands consumed during such time, which scales up with issue width. Today, with a 10-12 cycle latency to L2 cache, to preserve a similar 10 registers per cycle ratio would require between 100 and 120 registers.
Large numbers of registers are in fact built, e.g., both the Power4 and Power5 microprocessors implement many more than the 32 architected registers. Several recent microprocessors implement a technique called register renaming, whereby the limited number of architected registers is translated to use a larger pool of (physical) registers internally. However, to exploit these larger register files, complex (and area intensive) renaming logic and out-of-order issue capabilities are required. Even then, the inability to express the best schedule for the program using a compiler or a skillfully tuned Basic Linear Algebra Subprogram (BLAS) or other such library limits the overall performance potential.
While register renaming does allow an increase in the number of registers, register renaming is a complex task that requires additional steps in the instruction processing of microprocessors. Thus, what is required to address the challenges in modern microprocessor design is an increased number of registers that are easy to access using an extended name space in the architecture, as opposed to techniques such as register renaming used in high-end microprocessors such as the IBM PowerPC 970 and Power5.
Recently, the IA-64 architecture and the CELL SPU architecture have offered implementations with 128 architected registers. In reference to these implementations, the IA-64 offers an implementation using instruction bundles, a technique to build instruction words wider than a machine word. While this resolves the issue of instruction encoding space, it leads to inefficient encoding due to a reduction of code density because an instruction word disadvantageously occupies more than a single machine word, thereby reducing the number of instructions which can be stored in a given memory unit.
Recent advances in the encoding of instruction sets, disclosed in the U.S. patent application to Altman et al., entitled “Method and Apparatus to Extend the Number of Instruction Bits in Processors with Fixed Length Instructions in a Manner Compatible with Existing Code”, U.S. patent application Ser. No. 10/720,585, filed on Nov. 24, 2003, which is commonly assigned and incorporated by reference herein, advantageously allows wide instruction words to be used in conjunction with fixed size word instruction set architectures having an instruction format requiring only a single machine word for most instructions. While this offers a significant advantage over prior wide-word bundle-oriented instruction sets in terms of code density, decoding complexity is increased.
In an advantageous implementation of fixed width 32 bit instruction words, the CELL SPU instruction set architecture supports the specification of 128 registers in a 32 bit instruction word, implementing a SIMD-ISA in accordance with the U.S. patent application to Gschwind et al., entitled “SIMD-RISC Microprocessor Architecture”, U.S. patent application Ser. No. 11/065,707, filed on Feb. 24, 2005, and the U.S. Pat. No. 6,839,828 to Gschwind et al., entitled “SIMD Datapath Coupled to Scalar/Vector/Address/Conditional Data Register File With Selective Subpath Scalar Processing Mode”, which are commonly assigned and incorporated by reference herein.
While the SPU advantageously offers the use of 128 registers in a fixed instruction word using a new encoding that, in turn, uses fields of 7 adjacent bits in a newly specified instruction set, this is accomplished in an entirely new instruction set architecture without regard to existing (legacy) instructions or programs. Legacy architectures, on the other hand, are not without deficiency. For example, since many bit combinations have been assigned a meaning in legacy architectures, and certain bit fields have been set aside to signify specific architectural information (such as extended opcodes, register fields, and so forth) legacy architectures offer significant obstacles to encoding new information. Specifically, when allocating new instructions, the specification for these new instructions cannot arbitrarily allocate new fields without complicating the decoding of both the pre-existing and these new instructions.
Additionally, the number of bits in instruction sets with fixed instruction word width limits the number of different instructions that can be encoded. For example, most RISC architectures use fixed length instruction sets with 32 bit instruction words. This encoding limitation is causing increasing problems as instruction sets are extended. For example, there is a need to add new instructions to efficiently execute modern applications. Primary examples are multimedia extensions such as INTEL's MMX, SSE and SSE2 and the PowerPC VMX/Altivec extensions. Moreover, the number of cycles required to access caches and memory is growing as the processor frequencies increase. One way to alleviate this problem is to add more registers to the processor to reduce the number of loads and stores (typically required to spill and restore register values when insufficient register resources or names are available). However, it is difficult or impossible to specify additional registers in the standard 32-bit RISC instruction encoding.
The most common solution to this problem is an approach typically associated with CISC architectures, which allows multiple instruction lengths, rather than maintaining a single, fixed instruction size such as 32 bits. This variable length CISC approach has several drawbacks, and was one of the reasons RISC was developed in the 1980's. Among the problems associated with variable length CISC encoding is the additional complexity it requires in the instruction decode, resulting in additional decode pipeline stages in the machine or a reduced frequency. Moreover, another problem with variable length CISC encoding is that it allows instructions to span natural memory boundaries (e.g., cache line and page boundaries), complicating instruction fetch and virtual address translation. Another problem with variable length CISC encoding is that such a CISC approach cannot be compatibly retrofitted to a RISC architecture. For example, architectures having fixed length instructions today assume pervasively that all instructions are aligned on the boundary, that branch addresses are specified at a multiple of a fixed length instruction, and so forth. Further, no mechanisms are defined to address the issue of page-spanning instructions, and so forth.
A second solution to the problem would be to widen all instructions to a wider format, preferably a multiple of the original instruction set. For typical 32 bit RISC instruction sets, the next multiple is 64-bit instructions. However, if all instructions are 64-bits, approximately twice as much memory space as is currently used would be required to hold instructions (which would disadvantageously affect elements like an instruction cache). In addition, this is incompatible with existing RISC code with 32-bit instructions. If 32-bit and 64-bit instructions are intermixed, then the instruction set becomes CISC-like with variable width instructions, and with the associated problems described above.
Another solution to the encoding problem is employed by the IA-64 architecture from INTEL and HEWLETT PACKARD. The IA-64 packs three instructions in 16 bytes, for an average of 42.67 bits per instruction. All instruction bundles in this IA-64 encoding are located at multiples of the bundle size. This provides a simplification of some aspects, e.g., an implementation can avoid the issues associated with bundles crossing natural memory boundaries, but does not address the other significant drawbacks.
This style of instruction bundles encoding avoids problems with page and cache line crossings. However, it “wastes” bits specifying the interaction between instructions. For example, “stop bits” are used to indicate if all three instructions can be executed in parallel or whether they have to be executed sequentially or some combination of the two. The three instruction packing also forces additional complexity in the implementation to deal with three instructions at once. Finally, this three instruction packing format has no requirement to be compatible with existing 32-bit instruction sets, and there is no obvious mechanism to achieve compatibility with (legacy) 32-bit RISC encodings.
A number of approaches have been disclosed to address this increasingly severe problem of insufficient space to encode extended register names within an existing instruction set while maintaining compatibility with legacy programs, tools, and so forth.
U.S. Pat. No. 6,157,996 to Christie et al., entitled “Processor Programably Configurable to Execute Enhanced Variable Byte Length Instructions Including Predicated Execution, Three operand Addressing, and Increased Register Space”, which is incorporated by reference herein, teaches the use of a prefix byte to extend instruction semantics to include at least one of predicate information, extended register specification, and a third register operand. This embodiment is undesirable for fixed instruction width RISC processors, as extension bytes are generally incapable of being accommodated in the instruction stream of a fixed width instruction set architecture.
U.S. Pat. No. 6,014,739 to Christie, entitled “Increasing General Registers in X86 Processors”, which is incorporated by reference herein, discloses that an extra byte is extended in a variable instruction set to provide additional encoding bits. This embodiment is undesirable for fixed instruction width RISC processors, as extension bytes cannot readily be accommodated in the instruction stream of a fixed width instruction set architecture.
U.S. Pat. No. 5,822,778 to Dutton et al., entitled “Microprocessor and Method of Using a Segment Override Prefix Instruction Field to Expand the Register File”, which is incorporated by reference herein, discloses a microprocessor with expanded functionality within an existing variable length instruction set architecture. A control unit detects the presence of segment override prefixes in instruction code sequences executed in flat memory mode and uses prefix values to select a bank of registers. Those skilled in the art will understand that the cost of decoding a prefix, determining the mode and the bank field, accompanied by fetching the instruction being modified by the prefix, incurs a significant complexity, delay and hardware inefficiency. In particular, the decoding of the prefix and bank selector has to be performed early, leading to additional complexity. In addition, prefixes are generally not readily employed in an architecture supporting only a fixed instruction word width.
Another non-transparent use of segment register override prefix bytes may be embodied within an instruction decode/execution unit. A decode/execution unit reads instructions, and operates on operands in a register (or registers) specified in the instruction. In this embodiment, segment register override prefix bytes are used by a control unit to select one of multiple register banks which store the operands to be operated on by the decode/execution unit. Each register bank includes the full complement of x86 registers. In this manner, the register set of the architecture may be expanded without changing the instruction encodings. As will be appreciated by those skilled in the art, a larger register set allows more operand values to be held in registers (which may be accessed quickly) and, thus, accesses to memory (which typically require a longer period of time) are lessened. In one embodiment, no segment register override prefix byte specifies the first bank of registers, a segment register override prefix byte indicating the FS segment register specifies a second bank of registers, a segment register override prefix byte indicating the GS segment register specifies a third bank of registers, and a segment register override prefix byte indicating the ES segment register specifies a fourth bank of registers. In another embodiment, the value stored within the selected segment register is used to select the appropriate register bank from numerous register banks.
In accordance with the preceding description relating to the other non-transparent use of segment register override prefix bytes embodied within an instruction decode/execution unit, all operands for a given instruction have to be retrieved from a common bank selected by the prefix selector, specified within the prefix selector in an alternate embodiment. Using the segment selector as a bank selector for all operands of a given instruction is undesirable because it requires access to a control register to identify a bank, and restricts all instructions to have operands coming from just a single bank, leading to inefficient register allocation. Thus, if a common value has to be combined with other operands residing in multiple banks, copies of the common value have to be maintained, computed and updated in all banks, such that they can be combined with the other operands residing in the other banks, leading to inefficient register usage due to data duplication, and inefficient performance profile due to the duplication of work to compute the common value in all banks. It is to be appreciated that the preceding implementation has to be programmed similar to a clustered machine, with distinct register files represented by the different banks.
U.S. Pat. No. 5,822,778 to Dutton et al., entitled “Microprocessor and Method of Using a Segment Override Prefix Instruction Field to Expand the Register File”, which is incorporated by reference herein, discloses that the prefix and the bank select are decoded first, before the instruction is actually retrieved. Then the instruction word is combined, and an access performed. In comparison, the wide select can start the access early, and decode additional information in parallel with the access cycle.
U.S. Pat. No. 5,768,574 to Christie et al., entitled “Microprocessor Using an Instruction Field to Expand the Condition Flags and a Computer System Employing the Microprocessor”, which is incorporated by reference herein, discloses a microprocessor that is configured to detect the presence of segment override prefixes in instruction code sequences being executed in flat memory mode, and to use the prefix value or the value stored in the associated segment register to selectively enable condition flag modification for instructions. An instruction which modifies the condition flags and a branch instruction intended to branch based on the condition flags set by the instruction may be separated by numerous instructions which do not modify the condition flags. When the branch instruction is decoded, the condition flags it depends on may already be available. In another embodiment, the segment register override bytes are used to select between multiple sets of condition flags. Multiple conditions may be retained by the microprocessor for later examination. The conditions that a program utilizes multiple times may be maintained while other conditions may be generated and utilized.
U.S. Pat. No. 5,838,984 to Nguyen et al., entitled “Single-Instruction-Multiple-Data Processing Using Multiple Banks of Vector Registers”, which is incorporated by reference herein, discloses a digital signal parallel vector processor for multimedia applications. As disclosed therein, a single instruction multiple data processor uses several banks of vector registers. This processor uses a bank bit included in a control register to identify a primary bank, and a secondary alternate bank to be identified by a select set of instructions. This is undesirable because it requires the access to a control register to identify a bank, and restricts all operations to have operands coming from just a single bank, leading to inefficient register allocation. Thus, if a common value has to be combined with other operands residing in multiple banks, copies of the common value have to be maintained, computed and updated in all banks, such that they can be combined with the other operands residing in the other banks, leading to inefficient register usage due to data duplication, and inefficient performance profile due to the duplication of work to compute the common value in all banks. It is to be appreciated that the preceding implementation has to be programmed similar to a clustered machine, with distinct register files represented by the different banks.
U.S. Pat. No. 5,926,646 (hereinafter the “'646 patent”) to Pickett et al., entitled “Context-Dependent Memory-Mapped Registers for Transparent Expansion of a Register File”, which is incorporated by reference herein, discloses a context dependent memory mapped register accessing device for transparent expansion of a register file in a microprocessor of a computer system. Therein, in-core registers are made available as a memory-mapped address space. While the adding of additional registers in the core to be referenced by the processor is allowed, the use of memory mapping has several disadvantages. Specifically, the disadvantages relate to the fact that register names can only be properly resolved after the address generation phase, as a multitude of memory address forms can refer to a memory mapped register. This will increase the latency of access to these registers to almost the latency for first level cache access. In addition, a memory-mapped register can only be referenced for those instructions that have operand forms allowing memory accesses. This typically represents only a subset of operations, and often only a subset of operands therein. This limitation is particularly severe for RISC processors, which can only reference memory operands in load and store operations, imposing the additional cost of performing copies from the memory-mapped in-core registers to computationally useable operand registers.
In another disadvantageous aspect of the '646 patent, when addresses are generated before address generation from a subset of “preferred forms”, address aliasing can occur and lead to incorrect program execution. In yet another disadvantageous aspect of the '646 patent, when an address to such in-core register is added to a linked list, and accessed by a remote processor, this will lead to data coherence inconsistencies. Alternatively, costly methods for accessing such registers from SMP remote nodes have to be implemented and provided.
U.S. Pat. No. 6,154,832 to Maupin, entitled “Processor Employing Multiple Register Sets to Eliminate Interrupts”, which is incorporated by reference herein, discloses a processor which assigns a specified register set for a default task and other sets for different interrupt sources. While this extends the number of registers implemented in the processor, such an approach is not suitable for the extension of the register set useable by a single process or program.
U.S. Pat. No. 5,737,625 (hereinafter referred to as the “'625 patent”) to Jaggar, entitled “Selectable Processing Registers and Method”, which is incorporated by reference herein, discloses a high performance memory register selection apparatus which has a controller responding to a selection-word to control a circuit to select registers depending on the control field of a word and the prior register selection. This is limited in that only the architected set of prior art registers can be accessed at any one time, thus not making more than the number of prior art registers available at any one time.
In another disadvantageous aspect of the '625 patent, additional instructions are required in the instruction stream to update the control word. In typical implementations, these updates will have to be made context synchronizing, i.e., no operations before the update may have outstanding references, nor can any instruction occurring in the instruction stream be dispatched until the control register update has completed. In one non-synchronizing aspect of an implementation, multiple rename versions of the control register have to be maintained, disadvantageously leading to design complexity, and high area and power usage.
U.S. Pat. No. 5,386,563 to Thomas, entitled “Register Substitution During Exception Processing”, which is incorporated by reference herein, discloses a data processing system operable in either main or exception processing mode. In accordance with the invention, the CPU restores data stored in a saved processing status register, to another register upon leaving exception-processing mode. While this extends the number of registers implemented in the processor, this is not suitable for the extension of the register set useable by a single process or program.
Microcode used for implementing microprocessor ISAs using internal layering has used a variety of formats, using contiguous or non-contiguous fields. None of these were concerned with the maintenance of cross-generational compatibility or programming orthogonality. In general, microcode has different requirements, and methods from microcode are recognized to not be applicable to architected instruction sets by those skilled in the art due to issues related to the internal representation, requirements for compatibility, decoding of instructions and detection of data and structural hazards (which are not supported in the restricted microcode programming model), as well as the need of maintaining compatible across generations of a design.
Prior art instruction sets have offered the use of non-contiguous immediate constants, e.g., as disclosed by Moreno et al., in “An innovative low-power high-performance programmable signal processor for digital communications”, IBM Journal of Research and Development, Vol. 47, No. 2/3, 2003, which is incorporated by reference herein, to allow extended immediate specifications in bundle encodings, but do not address the encoding of non-contiguous fields in a fixed width instruction. The issues for immediate operand and similar fields are different because they do not require any early steering and access to determine dependences, access of register files, and so forth. In particular, this has also not required advanced decoding and register file access implementations. Thus, while constants have been encoded in non-contiguous ways in bundle instruction sets, the encoding of non-contiguous register file specifiers in fixed width instruction sets have not been address in this and other instruction sets.
In accordance with modern compilation techniques, register allocation is usually performed using graph coloring techniques, as first described by Chaitin, in “Register Allocation and Spilling via Graph Coloring”, ACM SIGPLAN Conference on Compiler Construction, June 1982, which is incorporated by reference herein. Briggs et al., in “Coloring Heuristics for Register Allocation”, ACM SIGPLAN Conference on Programming Language Design and Implementation, July 1989, which is incorporated by reference herein, disclose an improved framework for spill code handling. Briggs et al., in “Coloring Register Pairs”, ACM Letters on Programming Languages and Systems, Vol. 1, No. 1, March 1992, which is incorporated by reference herein, disclose a method for improved handling of register pairs in conjunction with Chaitin's method. Briggs et al., in “Rematerialization”, ACM SIGPLAN '92 Conference on Programming Language Design and Implementation, June 1992, which is incorporated by reference herein, disclose a method for reducing the cost of spill code. Briggs et al., in “Improvements to Graph Coloring Register Allocation”, ACM Transactions on Programming Languages and Systems, Vol. 16, No. 3, May 1994, which is incorporated by reference herein, describe a framework for generalized register allocation with reduced spill code, register pairing, and coalescing methods.
Chow et al., in “Register Allocation by Priority-Based Coloring”, ACM SIGPLAN Conference on Compiler Construction, June 1984, which is incorporated by reference herein, disclose an alternate method of applying register allocation using graph coloring based on live range splitting. Vegdahl, in “Using Node Merging to Enhance Graph Coloring”, ACM SIGPLAN Conference on Programming Language Design and Implementation, May 1999, which is incorporated by reference herein, describes the use of node merging to enhance graph coloring for nodes which may be allocated to a same register. Park et al., in “Optimistic Register Coalescing”, ACM Transactions on Programming Languages and Systems, Vol. 26, No. 4, July 2004, which is incorporated by reference herein, disclose improved methods for optimistic register coalescing.
While the referenced work describes a general framework for performing register allocation using graph coloring, coalescing, spill code optimization, and paired register allocation, and so forth, none of the described prior works deals with code generation for an architecture with an extended register specification as outlined herein.
Chaitin et al., in “Register Allocation by Coloring”, IBM Research Report 8395, 1980, which is incorporated by reference herein, propose the introduction of additional graph nodes and edges to the interference graph in order to represent constraints. While this allows representing the constraints of the specification, the approach was proposed to incorporate constraints covering a small set of nodes. Adding a significant number of edges and nodes to represent architecture constraints represents a significant cost, as indicated by Chaitin et al., in “Register Allocation and Spilling via Graph Coloring”, which is incorporated by reference herein. The register interference graph is a large and massive data structure, and it is important to represent it in a manner that uses as little storage as possible consistent with the ability to process it at high speed.”
Runeson et al., in “Generalizing Chaitin's Algorithm: Graph Coloring Register Allocation for Irregular Architectures”, Uppsala University Technical report 2002-021, Uppsala, Sweden, which is incorporated by reference herein, describe an extension to Chaitin's algorithm for irregular architectures. In accordance with Runeson et al., a colorability test called “<p,q> test” is implemented. However, while this test allows the representation of constraints in an irregular architecture, it is only an approximation of colorability. In addition, while this test allows for the representation of colorability for a wide range or architectures, the test is expensive to implement, resulting in slow compilation times.
Smith et al., in “A Generalized Algorithm for Graph Coloring Register Allocation”, ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2004, which is incorporated by reference herein, describe a similar test, with similar deficiencies.
Kong et al., in “Precise Register Allocation for Irregular Register Architectures”, 31st annual ACM/IEEE International Symposium on Microarchitecture, November 1998, which is incorporated by reference herein, describe an integer programming approach to register allocation. Using integer programming for register allocation gives good allocation results, but at the expense of runtime (i.e., long compilation times).
Turning to FIG. 1A, a register allocation method 100 is indicated generally by the reference numeral 100. The method 100 of FIG. 1A is described in the above-referenced article by Chaitin entitled “Register allocation and spilling via graph coloring”.
In step 110, an interference graph is built. In step 120, the interference graph is simplified by applying a colorability test to each node and, if the node is determined to be colorable, pushing that colorable node onto the stack. The node is then removed from the graph. This step is repeated until no colorable nodes can be found. When no colorable nodes can be found, control transfers to step 130 (“spill”). A node is selected and removed from the graph for use as memory operand. Spill code is inserted to ensure references to the spilled node can be properly executed. The method then transfers to step 110. In step 140, the graph is colored by removing nodes from the stack in last-in, first-out order and allocating colors to nodes.
Turning to FIG. 1B, an improved register allocation method is indicated generally by the reference numeral 140. The method 140 of FIG. 1B is described in the above-referenced article by Briggs entitled “Coloring Heuristics for Register Allocation”.
In step 150, an interference graph is built. In step 160, the interference graph is simplified by applying a colorability test to each node, and pushing a colorable node on the stack. The node is then removed from the graph. This step is repeated until no colorable nodes can be found. When no colorable node is found, a spill candidate is identified, and pushed on the stack. The node is then removed from the graph. Coloring then resumes. In step 170, the graph is colored by removing nodes from the stack in last-in, first-out order and allocating colors to nodes, by selecting a color that is not in interference with the node. When a node is found which cannot be colored, because it has k or more neighbors, it is left uncolored. After the coloring has been completed, if any nodes are uncolored, control transfers to step 180 (“spill”). Otherwise, the method terminates after step 170. In step 180, spill code is generated, and control transfers to step 150.
The method described by Chow et al. in the above-referenced article entitled “Register Allocation by Priority-Based Coloring” follows the basic algorithm described above, but uses live range splitting in lieu of spilling.
Referring now to step 120 of FIG. 1A, and step 160 of FIG. 1B, the central operation during the simplify step is the performance of the colorability test.
Turning to FIG. 2, a method illustrating the simplify steps 120 and 160 of FIGS. 1A and 1B, respectively, is indicated generally by the reference numeral 200.
In step 210 a node is selected. In step 220, the colorability test is performed. In accordance with Chaitin as described in “Register Allocation and Spilling via Graph Coloring” (and also used by Briggs et al., as described in the above-referenced article entitled “Coloring Heuristics for Register Allocation”, and Chow et al., in the above-referenced article entitled “Register Allocation by Priority-Based Coloring”), this test is of the form degree(node)<k. Disadvantageously, this test cannot determine colorability in an extended register specification as set forth herein.
As described by Runeson et al. in the above-referenced article “Generalizing Chaitin's Algorithm: Graph Coloring Register Allocation for Irregular Architectures”, a <p,q> test is performed. As described by Smith et al. in the above-referenced article “A Generalized Algorithm for Graph Coloring Register Allocation”, a similar test is performed. Disadvantageously, these tests are only an approximation and are excessively general, and hence expensive to implement.
In step 230, if the outcome of the test is positive (i.e., indicating that the node is colorable), control transfers to step 240. Otherwise, if the test is not successful, control transfers to step 250.
In step 240, a colorable node has been identified. The node is pushed on the stack, removed from the interference graph, the node counts are updated, and control passes to step 210 to select the next node.
In step 250, the node has been determined to not be colorable. If any potentially colorable nodes are left, control transfers to step 210 to select the next node. If no colorable nodes are left, control transfers to step 260.
In step 260, no colorable nodes are found, and a spill candidate is selected. If a spill candidate is identified, control passes to step 270. Otherwise, if no spill candidate can be found, and no node is colorable, the graph is empty, and the “simplify” phase has completed and the simplify method terminates.
In step 270, a spill candidate has been found. It is handled in accordance with one or more specific methods for handling a spill candidate: (1) in accordance with the above-referenced article by Chaitin, entitled “Register allocation and spilling via graph coloring”, the spill candidate is spilled immediately, and spill code is inserted; (2) in accordance with the above-referenced article by Briggs, entitled “Coloring Heuristics for Register Allocation”, the spill candidate is pushed on the stack, and spill code will be generated later; and (3) in accordance with the above-referenced article by Chow et al., entitled “Register Allocation by Priority-Based Coloring”, the live range will be split.
In step 280, the original node is removed from the interference graph, the node counts are updated, and control passes to step 210 to select the next node.
Thus, what is needed is an improved register allocation approach with an improved register colorability test.
In another aspect of allocating registers for the described registers, spill code should be optimized.
In traditional register allocation, when a register requirement cannot be allocated to a register satisfying its constraints (i.e., it is not colorable), then some register is spilled to memory. In accordance with an implementation of an extended register specification, it can be preferable to spill a register into an alternate register class of the specification.
A brief description of the handling of intrinsics in accordance with the prior art will now be given.
Referring now to the use of intrinsic as a specification of operations to be executed by a program, the current state of the art is shown in FIG. 3. Turning to FIG. 3, a method for handling intrinsics is indicated generally by the reference numeral 300.
In step 310 a specific intrinsic is identified. In step 320 an intermediate language IL representation is generated from the program-specified intrinsic. In step 330 register allocation is performed in accordance with a known register allocation method. In step 340, an ISA specific encoding is performed, but excluding the used of a fixed-width instruction word extended register specification. The method terminates after step 340.
Turning to FIG. 4, another method for processing intrinsics is indicated generally by the reference numeral 400. The method 400 is based on an advanced use of polymorphic intrinsics (i.e., intrinsics which specify an operation where the specific instruction is one from a set of instructions, dependent on the operand data type) such as specified in the VMX application programming interface.
In step 410, an intrinsic is identified. In step 420, a test is performed identifying the intrinsic specification in the program representation to refer to one of a polymorphic and a non-polymorphic intrinsic. If the intrinsic is not polymorphic, control transfers to step 430. If the intrinsic is polymorphic, control transfers to step 440.
In step 430, the intrinsic is known to be not polymorphic. The intrinsic is directly mapped to its internal language (IL) representation, and processing continues with step 450.
In step 440, a polymorphic intrinsic has been encountered. In accordance with the polymorphic intrinsic specification, the intrinsic type is derived from the input data types at the high level language level (i.e., specified by the programmer using the high level language's data type system). A simple table lookup is made, and the IL representation of a specific intrinsic is generated based on the specification provided by the data type. Processing continues with step 450.
In step 450 register allocation is performed in accordance with a known register allocation method.
In step 460, an ISA specific encoding is performed, but excluding the used of a fixed-width instruction word extended register specification. The method terminates after step 460.
Referring now to FIGS. 5A-5C, there are shown instruction formats 500, 530, and 560, respectively, for the VMX instruction set extension to the PowerPC architecture. In accordance with the VMX instruction set specification, a VMX instruction includes a 6 bit primary opcode field in bit positions 0 to 5 (labeled as opcode OPCD field).
VMX instructions are encoded using one of 3 basic format types. In the first format type 500, shown in FIG. 5A, X-Form operations are identified by a primary opcode field with value decimal 31, and are used to implement load and store instructions, as well as other instructions used to support memory access, such as lsvl and lsvr instructions. In accordance with the X-Form format, there is provided a secondary (or extended) opcode field (labeled as extended opcode XO field) from bits 21 to 30. In addition, there are provided 3 register specifier fields, for one VMX source operand (for store instructions) or target operand (for load or compute permute control word instructions) in register specifier field VT from bits 6 to 10, and two general purpose register specifier fields, ranging from bits 11-15, and bits 16-20.
In the second format type 530, shown in FIG. 5B, VA-Form operations are identified by a primary opcode with value decimal 4 and a subset of XO field bits, and are used to implement 4 operand VMX operations, such as permute, select and fused-multiply-add operations (and so forth), all having 3 vector register input operands and one vector register output operand, as well as some shift and other rotate by immediate operations, having one immediate and 2 vector register input operands, and one vector output operand. In accordance with this format, the instruction has an XO-field (labeled as extended opcode XO field) ranging from bits 26 to 31.
In the third format type 560, shown in FIG. 5C, VX-Form operations are identified by a primary opcode with value decimal 4 and a subset of XO field bits, and are used to implement 3 operand VMX operations, such as arithmetic and logical operations, compare, and so forth. The VX form has an extended opcode XO-field (labeled as extended opcode XO field) ranging from bits 21 to 31. In addition, the VX format has a vector register target specifier field VT in bits 16-10, and two source operand specifiers in bits 11-15, and bits 16-20, which can be used to specify either a vector register input, or a signed or unsigned immediate constant operand, based on the particular XO format selected by the value of the XO field.