Deep-pipelined architectures are widely employed in application specific integrated circuits (ASICs) and graphics processors. Pipelining is an implementation technique that divides tasks such as rendering a graphic object or implementing wavelet transforms into smaller subtasks that are allocated to a plurality of sequential units or stages. Its goal is to achieve a relatively high throughput by making full use of the processing capabilities of a device. In microprocessors, this technique is sometimes used to accelerate individual units, such as functional units and caches, and in ASICs and graphics processors, deep-pipelining is frequently used to handle high-level graphic tasks.
As shown in FIG. 1, a deep-pipelined architecture typically includes configuration registers 22 that are used to store configuration parameters employed in processing tasks. The configuration parameters are supplied to the configuration registers under the control of a host central processing unit (CPU) 24, to enable processing of the tasks by N pipeline stages 26a, 26b, 26c, . . . 26n. The configuration registers are connected to the pipeline stages, and for every processing cycle, the output of each pipeline stage is passed to the next stage in the sequence. For each task, host CPU 24 sets up the required parameters in configuration registers 22 and then issues a “run” command to the pipeline. As used herein, the term “run” refers to a coordinated action applied to execute a task using the stages in a pipeline. Since in the configuration shown in FIG. 1, host CPU 24 needs to wait for the pipeline to complete the current run before initiating the next run, the time for each run includes the time for a setup and the time for execution of the tasks comprising the run in the pipeline. While the time for completing a setup (which is referred to herein as the “setup time” or “setup delay”) depends on the number of configuration registers that must be setup, the time for a pipeline execution of a task (i.e., the pipeline run time) depends on the complexity of the task being executed by the pipeline, e.g., the size of each triangle being rendered in a graphics pipeline.
FIG. 2A illustrates the times involved in executing tasks when one set of configuration registers is used in a pipeline. In this simple example, it is assumed that there are only three tasks (x, y, and z). For each task, the host CPU sets up the configuration registers and waits for the pipeline to finish the task. Therefore, a new task can be issued after each time interval that corresponds to the sum TS+TR, where TS is the setup delay and TR is the pipeline run time. The pipeline run time for each stage includes a pipeline active time (TA), when a task is being executed by the stage, and a pipeline delay (TP). A pipeline stage is active during TA, whereas it is idle during TP, waiting for the output of another stage to be completed. Thus, in deep-pipelined architectures, the setup delay and the pipeline delay together affect the overall pipeline utilization efficiency. FIGS. 2B and 2C illustrate known techniques that are used to reduce these undesired delays.
FIG. 2B shows the timing relationships where two sets of configuration registers are used to reduce the setup delay. In this case, the host CPU prepares a set of registers for the next run (e.g., prepares configuration register set 2 to run task A, while the pipeline processes the current run (e.g., task X) with another set of registers (e.g., register set 1), thus eliminating the setup delay, except for the first run. The total reduced time, which is indicated by the symbol {circle around (1)} in FIG. 2B, is equal to (NR−1)×TS, where NR is the number of runs and TS is the setup delay. An exemplary pipeline system 40 like those used in the prior art to reduce the setup delay is shown in FIG. 3A. Two sets of configuration registers (i.e., a master configuration register set 42 and a slave configuration register set 44) are sequentially connected. Host CPU 24 prepares the master configuration register set for the next run, while the pipeline works with the slave configuration register set. The configuration parameters in the master configuration register set are transferred to the slave set before each run, where they are available to the stages.
Two configuration register sets are sometimes used differently in the prior art to reduce the pipeline delay, instead of the setup delay, as shown in the timing diagram of FIG. 2C. In this case, as illustrated in a pipeline system 50 in FIG. 3B, host CPU 24 initially prepares a first configuration register set 54 and a second configuration register set 56 with parameters that are supplied through a 1:2 demultiplexer (DEMUX) 52 (or other switching configuration that carries out the equivalent function), and then initiates a run that uses the parameters in the first configuration register set. Once the first pipeline stage 26a (Stage 1) finishes the first run, the host CPU initiates the second run using the parameters in the second configuration register set 56. As will be apparent in FIG. 2C, other pipeline stages are still processing the first run with the parameters in the first configuration register set (e.g., stage N is still processing task X using the parameters in configuration register set 1) when the first pipeline stage starts the second run (e.g., processing task Y) with the parameters of the second configuration register set, so that the two runs overlap in time. In this example, a pipeline delay can be eliminated every two runs, as indicated by the time interval identified using the symbol {circle around (2)} in FIG. 2C.
The exemplary configuration shown in FIG. 3B employs two sets of configuration registers that are connected in parallel. The host CPU sets up both configuration register sets, as indicated by the symbol {circle around (1)} in FIG. 3B, and then sends an index value (1 to 2) to the first pipeline stage, as indicated by the symbol {circle around (2)} in FIG. 3B. This index value (shown as A) is passed to the next stage of the pipeline along with the output of the previous stage (shown as B), and the index value is employed to select a proper configuration register set for each of stages 26a, 26b, 26c, . . . 26n via 2:1 multiplexer (MUX) array stages 58a, 58b, 58c, . . . 58n (or using other techniques that achieve the same function).
Even though pipeline system 50 can reduce the pipeline delay to some extent and can be extended to multiple register sets (e.g., to M configuration register sets), a pipeline delay still exists every M runs, in addition to the setup delay. This technique could be rather useful in the case where multiple sets of configuration parameters are required for a higher-level task and most parameters can be reused in subsequent higher-level tasks, e.g., multi-pass rendering in a graphics pipeline where several runs are required to map multiple textures to a triangle. However, the prior art enables either setup delays or pipeline delays to be substantially reduced, but not both. Clearly, it would be desirable to substantially reduce or eliminate the overhead of both setup and pipeline delays in consecutive pipeline runs, thus increasing the overall pipeline utilization.