In general, a digital computer is an apparatus for performing arithmetic and other computational operations on data stored in the form of binary information in the computer's memory. Frequently, a computer's memory is hierarchically organized; that is, a computer may have a large amount of primary or main memory for storing data and computer instructions, and possibly one or more levels of smaller and faster cache memory modules, for storing more frequently accessed data and instructions stored in main memory. The central processing unit (CPU) of a digital computer fetches computer instructions comprising a computer program from its memory, and performs operations on data in accordance with those instructions. Typically, the data being operated on by a CPU must first be copied or moved into the CPU's internal registers. Once a specified computation has been performed on data in the CPU's registers, the result may be retained in the register, or returned to cache or main memory.
Such a hierarchical memory system is characterized by having memory modules with slower access times near the bottom of the hierarchy, with progressively faster memories higher in the hierarchy. Frequently accessed memory locations are copied into faster cache memories higher in the hierarchy, so that less time is required to load the contents of those memory locations into the CPU's registers, when necessary to perform a computation specified by a program. Main memory is at the lower end of the memory hierachy having relatively slow access times. If the amount of main memory available is insufficient, data may be stored in a mass-storage device, such as a hard disk drive or the like, and brought into main memory or elsewhere in the memory hierarchy when needed; in that case, the mass storage device could be viewed as the lowest level of the memory hierarchy. Internal registers of a CPU are at the top of the hierarchy, since once a piece of data is stored in a CPU register, the CPU may utilize that data without fetching it from memory. Ideally, all data accessed by a computer would be stored in CPU registers; however, the amount of CPU register storage is usually limited to less than 128 bytes or so, while the total amount of memory accessible to the CPU must be extremely large, on the order of many millions of bytes.
As a consequence of the limited number of CPU registers available, it is necessary that data to be manipulated by the CPU be stored lower in the memory hierarchy during some portions of program execution, and transferred into the CPU registers when an operation involving that data is to be performed. When a data word is stored in a CPU register, it is often the case that data already stored in that register must first be removed from the register and stored at some lower point in the memory hierarchy, such as in a cache, in main memory, or even on a mass storage device, as previously described.
Naturally, the speed of execution of a computer program is dependent upon how long is required to access the data being operated on. As the processing speeds of modern computers have increased dramatically, often the time required to access data on which a computer will operate is the limiting factor in the overall processing speed of the system. Keeping frequently accessed data in higher-level cache memories reduces the amount of memory access time consumed during execution of a program, thereby enhancing execution efficiency. However, swapping data into and out of the CPU's internal registers is usually unavoidable, especially in long, complex programs, or where the number of values manipulated in the course of program execution exceeds the number of available internal CPU registers.
As a result of such considerations, the allocation of registers to particular values during program execution can have a significant impact upon the execution efficiency. If a frequently manipulated value is not stored in a CPU's register, the CPU must wait for access to a lower-level memory before operating on that value, and execution efficiency will suffer. On the other hand, if an infrequently-used value is stored in a register even during portions of program execution in which it is not referenced, this prevents the register from being allocated to a more-frequently referenced, or currently active value, also causing a decrease in execution efficiency.
There has been shown in the prior art a technique in which register allocation is treated as a graph-coloring problem. Each node in the graph represents a value that is stored in a CPU register, and two nodes of the graph are connected by an edge if the values "interfere" with each other. As is well known to those of ordinary skill in the art (see, e.g., U.S. Pat. No. 4,571,678 to Chaitin), two values are said to interfere with one another if they are different, and one is alive at the definition of the other. For the purposes of this description, a value is deemed to be "live": (1) if it has been computed or defined, and (2) the value will be subsequently be used in a computation before being re-computed or redefined. The period of time during program execution between when a value becomes alive and when it is no longer live is called the "live range" of that value.
As one skilled in the field of computer science will appreciate, the coloring of a graph is an assignment of a color to each of the nodes in the graph such that if two nodes are adjacent (i.e. connected by an edge of the graph), then they are assigned different colors. A coloring of a given graph is said to be an N-coloring if it does not use more than N different colors. The chromatic number of a graph is defined to be the minimal number of colors in any of its colorings, that is the least N for which the graph may be N-colored.
According to the present invention, all of the live ranges of a program are represented by nodes in a graph, called an interference graph, in which each edge represents an interference between two live ranges. In this way, if two live ranges exist at a single point in the program, there is an edge between their nodes in the interference graph. If a node has N neighbors, that node is said to be of degree N. Register allocation schemes in the prior art have attempted to color the interference graph with K colors, where K is the number of CPU registers available. If a K-coloring is found, each register is assigned a color, and live range nodes of that color are stored in the corresponding register during program execution.
If a K-coloring does not exist for a given interference graph, code must be added to the program to "spill38 one or more live ranges; that is, provisions must be made for certain values to be removed from registers during portions of program execution, and reloaded when such values are again referenced. This has the effect of eliminating the spilled live range and creating a new, small live range around each individual use or definition of that value within the program. This transforms the interference graph into one having additional nodes, but possibly fewer edges, and hopefully fewer interferences. Then the register allocation scheme will attempt to K-color the new interference graph. This iterative process of spilling and attempting to K-color the graph continues until a K-colorable graph is found.
It is widely known in the field of computer science that the problem of obtaining a minimal graph coloring is among a class of so-called nondeterministic polynomial-time complete (NP-complete) problems which can take time to solve that is exponentially proportional to the size of the graph. It is widely believed that the problems in the NP-complete class are incapable of being solved in time proportional to a polynomial function of the size of the problem; indeed no polynomial-time bounded solution to an NP-complete problem has yet been found. From the standpoint of register allocation, such exponential performance is clearly undesirable, since this would lead to impractical time for allocation, and thus for the whole compilation process
It has been proposed in the prior art, however, that the NP-completeness of graph coloring is not an insurmountable obstacle to a register allocation scheme based on graph coloring. In the prior art, certain heuristic approaches have led to graph-coloring-type register allocation schemes which take time linear in the size of the interference graph. One such scheme for coloring an interference graph is based on the principle that in order to obtain a K-coloring of graph G, if a node N has less then K neighbors, then no matter how the neighbors are colored there is necessarily one of the K colors left for node N; thus node N can be thrown out of the graph G. The problem of obtaining a K-coloring of G is therefore recursively reduced to the problem of obtaining a K-coloring of a graph G', where G' has one less node, and probably several edges less than graph G.
In practice, this method of the prior art has three phases: First, in Phase One, an interference graph is constructed, with one node for each live range and one edge for each interference. In Phase Two, the graph is simplified; this is accomplished by removing one at a time each node N with degree less than the number of CPU registers, along with all of its edges, and placing the node N in a stack; if the allocator reaches a state where all remaining nodes have degree greater than or equal to K, it must select a node to spill. Using some metric, it chooses a node to spill, removes it from the graph, records that this node will be spilled, then continues with Phase Two. For the node that is to be spilled, the original program must be modified to include program steps, called spill code, which instruct the computer to store the spilled value to memory after definition, and restore the value to a register before its subsequent use in the program. Once the allocator has modified the program in this manner, i.e., by inserting spill code, it goes back to Phase One, build the interference graph for the modified program, and attempts to find a K-coloring for this new graph. When the allocator has modified the program enough so that it finds a K-coloring, it proceeds to Phase Three. In Phase Three of the prior art scheme, colors are assigned to the nodes in the stack.
The prior art scheme is an iterative scheme in the sense that the entire process of building the interference graph, simplifying it, and inserting the spill code is repeated until a coloring can be achieved for the number of CPU registers. In Phase Two, the register allocator must decide which nodes with degree greater than the number of CPU registers to spill. In the prior art, it is suggested that the node with the lowest ratio of spill cost to degree should be spilled. For a live range, the spill cost may be defined as the number of additional cycles that would be required to save and restore the live range. Alternatively, the spill cost can be estimated to be the number of loads and stores that would have to be inserted in the program, weighted by the loop nesting depth of each insertion point. The spill cost may be precomputed for each node, such that when the register allocator reaches the point where it must choose a node to spill, it divides the precomputed spill cost by the node's current degree.
Once the necessary spill code has been inserted in the program, the actual coloring occurs in the third phase. In Phase Three, the register allocator removes a node from the top of the stack created in phase 2 and re-inserts it in the graph, along with all of its edges; this node is then assigned a color different from each of its neighbors. It has been shown in the prior art that this color process must succeed, given the work done in Phase Two. However, this prior art scheme is known to not be guaranteed to find the minimal coloring for a given interference graph.
It is accordingly a feature of the present invention to provide a technique for allocation of CPU registers during program execution which improves the execution efficiency of the program; such improvement arises from the reduction of instances in which a value stored low in the memory hierarchy must be fetched from its place in memory and loaded into a CPU register. Such improvement also results in a minimization or reduction of the number of times that a given value must be removed from a CPU register in order to make room for another value, and later restored in a CPU register.
It is another feature of the present invention to provide a technique for register allocation which itself does not consume an impractically large amount of pre-execution overhead or processing time. In particular, it is an object of the present invention to provide a register allocator whose execution time is asymptotically bounded by a function linear in the size of a program's interference graph.
A further feature of the present invention is to provide an improved register allocation scheme which increases the number of graphs for which the allocator finds a coloring that will fit in the number of registers provided by the machine.
Still another feature of the present invention is to provide a register allocator which performs effectively on graphs arising from real programs, as opposed to arbitrary or randomly generated graphs.
Yet another feature of the present invention is to provide a register allocator which makes decisions to spill live ranges based on non-arbitrary criteria.