The present invention relates to methods and apparatus for utilizing a plurality of registers in an indexed fashion such that data manipulation may be achieved using the registers as a local memory and such that data storage or loading from random access memory may be avoided.
Although the processing power of superscalar RISC processors is significant due, in part, to the fact that they use functional units (in which several instructions can be executed simultaneously), there are problems with this processing approach, for example, pipeline interlock. A pipeline interlock causes a delay in the fetching of successor instructions due to interruptions in the execution of preceding instructions.
There are two basic types of interlock delays in conventional RISC processors. The first kind of interlock delay is a data dependence delay that determines instruction latency. In this context, an instruction is not executed until all source data have been evaluated by prior instructions. The second kind of interlock delay is a reservation delay, which means that two instructions that are being executed may require a shared resources (e.g., data buses, internal registers, functional units, etc.) that are not always immediately available.
One of the conventional approaches to minimizing the impact of pipeline interlock delays is to utilize a fast random access memory (RAM), such as a hierarchical cache memory. Indeed, a level 1 (L1) cache memory may require only about 6 to 10 cycles to effect a storage or loading of data (when coupled to a processor running at a clock frequency on the order of a GHz). Reducing memory access latency generally has a positive effect on the overall processing speed, even when pipeline interlock delays exist.
There is, however, a limit on the efficacy of using hierarchical cache memories to offset the deleterious effects of pipeline interlock delays. Indeed, even hierarchical cache memories may exhibit latencies of about 6 to 10 clock cycles, where even lower latencies are desirable.
In order to avoid the latencies of RAM access, hierarchical cache access, or other data storage techniques, there has been a trend to utilize a large number of hardware registers as a stack for manipulating data. As hardware registers typically have latencies on the order of 1 clock cycle, they represent an attractive alternative to the use of RAM, cache, or other local memories that have higher latencies.
Although a substantial number of hardware registers may be employed as a surrogate memory for the manipulation of data, conventional instructions set architectures have not been optimized for intra-register data manipulation. For example, in order to move the data from one hardware register to another hardware register, some conventional instruction sets require that memory access take place, such as a memory store and a memory load. For example, the following operational code illustrates the dependency on RAM that conventional instruction sets have when effecting a transfer of data from one register, R1, to another register, R2.    STORE R1, address1; store R1's data in RAM at address1    LOAD address1, R2; load data from RAM address1 into R2
The substantial latencies associated with RAM, however, may offset any benefits from utilizing the hardware registers as a data stack. This problem is exacerbated when the software program being executed requires a significant number of table lookups and/or branch instructions.
It is noted that some existing instruction sets may permit access to a few registers as operands, which involves indexing to such registers. Unfortunately, any such access would have to be defined at the time that the software code was written. No real-time indexing by computing index values during program execution can be achieved using these existing instruction sets. Thus, reliance on memory access is still problematic in these systems. Other existing instruction sets might permit non-indexed register to register movement of data, but again the register definitions must be established at the time that the program is written and no run-time definitions can be performed.
Therefore, there are needs in the art for methods and apparatus that are capable of improving intra-register data processing such as moving data from one register to another, copying data from one register to another, etc. so that memory accesses may be significantly reduced and the associated latency may be avoided.