The central processing unit ("CPU") of a computer fetches program instructions and data from the computer system's main memory and performs logical or mathematical operations on the data as specified by the instructions. The CPU accesses the data to be operated upon ("source operands") by importing a copy of the data into its own internal memory called CPU registers. The results of the operations performed by the CPU (e.g. addition, subtraction etc.) on the source operands are also stored in other CPU registers. Each CPU operation requires at least one, and usually two source operands, plus a result or destination operand. The CPU spends a predominant amount of time fetching and storing these operands. Because computer system performance is largely measured by program execution speed, the ability to reduce CPU access (e.g. fetching and storing) time is critical to improving the system's performance.
An accepted approach for reducing the CPU operand access time is to provide a register file on board the CPU, a relatively small block of memory that provides physical storage for the CPU registers. CPU access to operands stored in the register file is inherently faster than for those stored in main memory. The register file is small enough to reside on the same integrated circuit (chip) as the CPU. Because the signal path delays are shorter between circuits on the same chip than for circuits between chips, faster access to operands occurs. The address word for the register file is smaller than the address word for main memory and therefore the decode time for the address word for the register file should be shorter than the decode time for an address word for main memory. Additionally, the smaller number of devices used for implementing the register file compared to main memory reduces associated parasitic capacitance, thus further reducing data path delay.
The purpose of a register file is to provide fast CPU access to an optimal number of operands, thus minimizing the number of times the CPU must fetch and store operands from and to main memory and thereby enhancing CPU execution speed.
There are two architectural constraints to consider when designing a register file. First, the number of CPU registers is constrained by the number of bits conveniently available to the CPU for addressing the registers. The address bits are included in the CPU instruction word, which is limited in length by the number of bits comprising the data path of the computer. Typically, five bits are set aside for addressing a total of thirty-two CPU registers.
The second constraint, which influences the organization of the register file, is the inherent characteristics of the computer programs which run on the computer system. Programs are generally comprised of a plurality of separate procedures which are called and executed multiple times during program execution. Each procedure may have its own set of operands necessary for its execution. Some of the operands are local; they are not shared among all of the other procedures in the program. Other operands are global; they are shared among all of the procedures in the program. Thus, if the register file were limited to just thirty-two memory locations to provide storage for the thirty-two registers, each time a new procedure was executed, a new set of operands would have to be imported from main memory. The operands required for the previous procedure would be over-written and would no longer be available the next time that procedure is called for execution.
The typical register file is a memory array, its physical storage locations organized into fixed size windows (i.e. blocks of memory). Each window represents a unique mapping of the thirty-two CPU registers onto an equal number of physical storage locations in the memory array. Each window typically contains twenty-four memory locations which are all allocated to provide physical storage for the local registers. Local registers are used to store local operands. A separate set of eight memory locations is allocated to provide storage for global registers, which are used to store global operands.
Thus, for each window the twenty-four window registers are assigned to a different block of memory. In other words, each time a new procedure is called by the program under execution, a new window of memory locations is assigned to the same registers. The previous procedure's operands are saved in one window of memory, while the operands unique to the second procedure reside in a different block or window of memory. Therefore, though the procedures use the same registers, they each use those registers with a different set of memory locations assigned to provide storage for the registers. In this way, the operands unique to one procedure may be stored and left uncompromised while another procedure is using the same registers to store its unique operands. As a result, the CPU does not have to re-import operands from external memory every time it jumps from one procedure to another.
The number of windows in a register file is determined by the level of nesting which can occur during program execution. Nesting occurs when a procedure A, while being executed, calls for the execution of a procedure B, which might then call for execution of a procedure C, and so forth. Once the last procedure called has been completed (i.e. procedure C), the CPU returns to and completes the execution of calling procedure B. Upon completing procedure B, the CPU returns and completes execution of calling procedure A.
Professors Hennessey and Patterson in "Computer Architecture, A Quantitative Approach," Morgan Kaufman Publishers, Inc. (1990) at pages 450-451, provide illustrations depicting program nesting in the context of having to fetch operands from main memory when the register file is full or empty. Referring to FIG. 1, a diagram representing the nesting constraint is shown. The x axis is representative of time, measured in procedure calls or returns; the y axis is representative of the depth of nesting of procedure calls. Each call moves the nesting down the y axis, and each return moves the nesting up the y axis. The boxes show main memory being accessed either when the register file is full and there is a procedure call (window overflow) or when the register file is empty and there is a return (window underflow). FIG. 1 shows eight window overflows and two window underflows during a particular section of a hypothetical program execution. Over the life of any program, the number of overflows and underflows should equal one another.
FIG. 2 shows the shape of a curve representative of the number of register windows versus the overflow rate for different programs in C, LISP and Small Talk. The knee of the curve occurs between six and eight windows. While six to eight windows appear to be optimal for most programs, there may be specific program patterns of calls and returns that could be quite different than those shown in FIG. 2. For example, in the worst case there might be hundreds of calls followed by hundreds of returns and thus the register file would have to have many windows to accommodate all of the levels of nesting.
An example of a typical register file is the one employed in the CY7C601 and CY7C611 products sold by Ross Technology, Austin, Tex. The register file on these chips has a memory array of one-hundred thirty-six memory locations which are each thirty-two bits long and are allocated such that one-hundred twenty-eight memory locations provide storage for the twenty-four local registers, and a set of eight memory locations provides storage for the eight global registers. The one-hundred twenty-eight local memory locations are mapped into eight windows of twenty-four memory locations each. Each window represents the same twenty-four local registers, but each window maps those twenty-four registers onto a different set of twenty-four memory locations. Only one window of memory locations is active at a time (the window visible to the programmer) and it is identified by a current window pointer (CWP), which is a three bit word stored in the CPU's state register. At any given time, a program can address thirty-two active registers: twenty-four local registers and eight global registers. By convention the local registers are classified into three groups: eight "in" registers, eight "local" registers, and eight "out" registers. The operands stored in the "in" and "out" (i.e. "shared") registers are not purely local, but can be passed from and to adjacent windows to enable other procedures to utilize the operands. The "in" and "out" registers are only shared with adjacent windows, therefore they are not global. The "local" registers are never shared; they are purely local. FIG. 3 is a table of the register naming convention for any window currently pointed to in the CY7C601 and CY7C611 implementations.
In the CY7C601 and CY7C611 the CWP points to a stack of one-hundred twenty-eight, thirty-two bit memory locations. The register file address is an eight bit word which is decoded to select and access one of the register file's one hundred thirty-six memory locations. The CWP comprises the three most significant bits of the register file address. Thus, an increment (or decrement) of the current window pointer offsets the decoded value of the register file address by sixteen. Twenty-four memory locations are accessed for a single CWP value, however, thus providing a window overlap of eight memory locations. This overlap in window memory locations creates an effective overlap of window registers which is used to pass parameters from one window to either of its two adjacent windows.
There are drawbacks to the typical register file implementations used today. First, there is a time delay associated with decoding the address bits which point to a particular memory location (the register file address) in the register file's memory array. Second, there is a time delay (access time) associated with either retrieving an operand stored in the register file memory array, or storing a new operand in the register file memory array. Third, the current implementations do not adapt well to the parallel launching of more than one instruction. Today's CPU architectures can launch and execute more than one instruction per-clock cycle, requiring simultaneous access to multiple operands. The conventional implementations can provide simultaneous multiple operand access, but at a cost of increased circuit size in the register file memory array, resulting in a much greater operand access time.