This invention is related in general to digital processing architectures and more specifically to the use of pipeline registers to facilitate improved processing performance.
A basic design for digital signal processor (DSP) 10 architecture is shown in the prior art diagram of FIG. 1A. DSP calculations require many iterations of fast multiply-accumulate and other repetitive operations. Typically, “functional units” such as multipliers, adders, accumulators, shifters, etc. are used to perform the operations. Such functional units are shown as 12, 14 and 16. The functional units obtain instructions and data, such as values, opcodes, operands, etc. (collectively referred to as “data”) from main memory 20 that is typically a random access memory (RAM). The DSP system can be included within a chip that resides in a device such as a consumer electronic device, computer, etc. Note that many variations on the design of FIG. 1A are possible. For example, a single functional unit, such as a general-purpose central processing unit (CPU) can be used. Typically, more than one memory storage unit is used, such as separate storage for instructions and data.
In the basic design, the functional units are constantly transferring data to and from memory, other functional units; and other devices, sources and destinations (collectively referred to as “components”). The speed at which data can be transferred among various components in the architecture design is a primary factor in determining the speed and efficiency of the overall design.
Since accesses to main memory (or external cache or other storage) are relatively slow and require using bus interface logic, one approach to improve performance is the use of bus register file interface 40 and bus register file 42. Bus register file 42 allows data to be stored proximately, and in association with, the localized bus 32 so that accessing the main memory is not necessary for frequently-needed values. However, this approach still places limitation on access times and system performance as explained, below, in connection with FIG. 1B.
FIG. 1B illustrates a basic pipelined instruction cycle having fetch, decode and execute stages. In FIG. 1B, pipelines 50, 60 and 70 allow concurrent execution of each stage in a manner that is known in the art. Thus, each of the fetch, decode and execute stages for different instructions or operations can be executed in a same clock cycle. This allows, e.g., the decode stage of instruction 52 to be executed at the same time as the fetch stage of instruction 62.
One drawback of this approach is that a result of an instruction is not available until the end of the execute cycle when the data is computed and stored back into a bus register. Since data is available to an instruction after the fetch stage, an instruction executing in a different pipeline may have to wait for one or more cycles before the data result of a different instruction is available. For example, FIG. 1B shows instruction 52 completing its execute stage at a time designated by line 54. However, at this time instruction 62 is past its fetch stage and so instruction 62 is delayed within the pipeline and its stages are repeated as instruction 64 so that a fetch stage is executed to obtain the data. Alternatively, the fetch stage of instruction 62 can be flushed, rescheduled, suspended, or affected in other ways, until after execution of instruction 52's execute stage. In any case, the inability of instruction 62 to have needed data at the time of its execute stage causes delays and inefficiencies in processing. The use of additional pipelines, such as pipeline 70 can compound and further complicate data accesses.
Thus, it is desirable to provide a design that improves data accesses in a digital processing architecture.