1. Field of the Invention
The present invention relates to the field of computer systems. More specifically, the present invention relates to processors of computer systems, their instruction set and associated register architecture.
2. Background Information
The performance of a processor is directly tied to its Instruction Set Architecture (ISA), which in turn is significantly dependent on the associated register file architecture, since execution is carried out by performing operations defined by the instructions, upon data which is typically held in the register file. Thus, if a particular register file architecture has inherent limitation(s), a processor implementation of the associated ISA would have difficulty in obtaining the desired performance.
Historically, when integrated circuit technology was still in its infancy, the earliest "register file" architectures were all centered around a single register, also known as the accumulator. A particular example of an accumulator-based architecture is the Motorola 68xx. Typically, under these architectures, almost every operation defined by the instruction set would use the accumulator as both the operand source (hereinafter simply source) and the destination for the result of the operation (hereinafter simply destination), thus creating significant data flow congestion. These architectures offered the advantage of compact instruction encoding, but the constraint of a single register made it virtually impossible to offer a high performance implementation or take advantage of advances in large scale integration.
Later architectures tend to offer linearly addressed register files having multiple registers. Some architectures would offer multiple linearly addressed register files, one for integer operations, and another for floating point operations. Additionally, a small number of control registers would also be included. These control registers are not used to store data variables. Instead, they are used to store status information about the processor to facilitate control of the overall operation of the processor.
The number of registers offered in a linearly addressed register file varies from architecture to architecture, but typically 32 or less integer registers are offered, and optionally another 32 or less floating point registers may be offered. These registers are usually numbered as R0 through R31 and are directly addressed (i.e. physically addressed). Examples of such linearly addressed register file architectures include MIPS.RTM., Alpha.RTM. and PowerPC.TM..sup.1. FNT .sup.1 MIPS and ALPHA are the registered trademark of MIPS Computer Inc., and Digital Equipment Corporation respectively; whereas PowerPC is a trademark of International Business Machine.
All three architectures define 32 registers each in separate integer and floating point register files. The width of a single datum is variable, but an instruction can specify up to two of the registers as sources, and a third register as the destination. Each architecture also offers a small set of control registers that can be manipulated via special instructions that require privileges to execute, or are otherwise outside the scope of normal operation of the processor.
The availability of 32 registers significantly reduces the data flow congestion into and out of the register files. However, as processor operations become deeply pipelined, and superscalar processors become the norm, these 32-register register file architectures again begin to be stressed to their inherent limitations. A typical instruction in a RISC microprocessor will use three registers in its execution, two for the sources and one for the destination. Thus, a four-scalar microprocessor can require the processing of 12 operands in parallel to execute the four instructions.
A pipelined microprocessor attempts to improve system performance by executing several instructions in parallel. The four phases of execution, i.e. fetch, decode, execute, and writeback, are arbitrarily subdivided into a number of pipeline stages that operate concurrently. If the execution of an instruction can be broken down into n equal stages, the clock cycle of the system could also be divided by n, thereby improving the maximum possible system throughput by n times. Thus, high performance microprocessors tend to be heavily pipelined to achieve maximum system performance. However, as the pipeline becomes deeper, more data must be operated on in parallel. The four-scalar microprocessor described above would require the coordination of up to 36 operands if three pipe stages were required to encompass the decoding of source operands to the writing back of the result data. For an eight-scalar microprocessor coordination of 72 operands could be required. These register requirements are more than the 32-register register file architectures can meet.
As a result, most super-scalar and deeply pipelined microprocessors adopt highly complex schemes to handle and process multiple values of the same register location simultaneously. However, the inherent limitations of these conventional linearly addressed 32-register register file architectures will cause them to eventually suffer the same congestion problems faced by the earlier accumulator-based architectures.
In addition to the basic data traffic flow problem, some architectures have adopted novel approaches to solve a problem commonly faced in integer operations. It is standard programming practice to subdivide the software problem into basic blocks called functions. A program can define a set of functions to address the individual portions of the overall problem, and can call upon these functions in the appropriate order to solve the problem in a "divide and conquer" manner. To efficiently use these functions, the program invoking the function must pass input data to the function, and must receive return data from the function. Thus, the need for a message-passing construct is implicit.
The SPARC.RTM..sup.2 architecture addresses this issue by providing "register windows". The register file is segregated into groups of eight registers. One of these groups is designated as the global registers, while the other groups are "windowed" registers. At any given time, an instruction has access to the global group and three groups of "windowed" registers, for a total of 32 registers. FNT .sup.2 SPARC is a registered trademark of SPARC International. The global registers are always visible to software, while the other three groups can change as a function of the value held in a variable called the "current.sub.-- window.sub.-- pointer" (cwp). The three groups of "windowed" registers are designated as Out's, Locals, and In's. Efficient parameter passing to/from function calls is implemented by defining the register windows to overlap. The group of registers designated as In's for a cwp value of two would correspond to the group of registers that would be designated as the Out's for a cwp value of three. Likewise, the group of registers that would be designated as the Out's for a cwp value of two would also correspond to the In's for a cwp value of one.
Thus, each In or Out register must be capable of recognizing two addresses for itself, depending upon the value of the cwp. Since the globals are always available, they are independent of the cwp; while the Local registers change with the cwp, they do not overlap between windows. The cwp can be incremented or decremented by user software, but an arbitrary window of registers can only be selected by supervisor software. The dual addressing requirement for the In/Out register group makes cwp changes difficult to efficiently implement in hardware, but the capability of passing parameters to/from function calls reduces the number of memory references that are required, and thus improves system performance. Register windows do not solve the register pressure issue that arises from only having 32 available registers. Hence, highly superscalar processors still face the same challenges in optimizing system performance.
Another problem certain register architectures have been configured to address is a class of applications that are highly parallel by the nature of the problem to be solved. These systems attempt to improve processing efficiency by grouping variables with similar processing requirements into a single quantity that is termed a "vector". The register architecture would provide vector registers, with each vector register capable of storing two or more variables (also referred to as elements or tuples of the vector). A vector register file is comprised of two or more such vector registers. A particular example of a vector processor is the Cray-1.
The Cray-1 has eight vector registers of 64 elements per vector register. Instead of requiring individual instructions to perform a given computation on individual pieces of data, hardware can construe a single vector instruction to perform the defined operation individually on all 64 corresponding pairs of data in two vector registers. The 64 results can then be written into a third vector register. This single-instruction-multiple-data (SIMD) approach is very effective for operating on data arrays where each element of the array can be treated identically. A single controlling mechanism can be defined to coordinate the processing of a large quantity of data.
Vectors offer a substantial reduction in complexity and can achieve high performance for applications that are vectorizable. But, systems that offer only vector processing can suffer large performance penalties in code that requires even a small amount of non-vector (i.e. scalar) processing. Thus, vector systems are often relegated to scientific applications where large arrays of data are processed in batch mode. They are not found in mainstream applications because of the limitation of only being able to efficiently execute parallel code.
Yet another problem that is commonly faced, but most register architectures have offered only minimal support is the problem of multi-processing, wherein multiple processes are executed at the same time. As a result, data must be provided to multiple independent contexts at the same time. However, except for basic context switching, traditionally the majority of the burden for supporting multi-processing has been borne by the operating systems.
Thus, it would be desirable if a processor can be implemented with an ISA having an associated register architecture that has a very high bandwidth to meet the data requirements of superscalar processors, supporting maximum instruction issue rate at the highest possible clock frequency. Additionally, it would be desirable if the register architecture would facilitate parameter passing to a function call, operate as either a vector/scalar register file, and provide data to multiple independent contexts, all with very low latency and minimal loss of efficiency, switching back and forth from the various types of scalar, vector and multi-processing. Furthermore, it would be desirable if the register architecture would be highly scalable, allowing a wide range of upward compatible embodiments to be manufactured. As will be disclosed in more detail below, these and other desirable results are advantageously achieved by the present invention of a processor implemented with an ISA having an associated scalable, uni/multi-dimensional, and virtually/physically addressed register architecture.