The present invention relates to processor design, and more particularly to a dynamic pipeline for executing instructions where the number of stages of the pipeline is dynamically modified depending upon the instruction or operation being executed.
Pipelining is used in microprocessors to improve performance, by overlapping multiple instructions in a pipeline structure to decrease overall execution time. Each instruction is broken down into one or more common elemental operations that are performed sequentially to complete that instruction. The pipeline structure is formed of a plurality of pipe segments or stages, where each stage performs one of the elemental operations. Thus the pipeline is similar to an assembly line where each of the elemental operations is performed in a corresponding stage of the pipeline. The instruction begins at one end of the pipeline and is completed at the other end. Each stage of the pipeline is separated by registers or latches, and thus a new instruction enters the first stage of the pipeline while one or more previous instructions are being executed within subsequent stages of the pipeline. In this manner, although the time required to execute each instruction is not changed substantially, the overall execution time for a plurality of instructions is decreased.
Previously, the design of pipelines generally conformed to a few simple rules. First, the number of stages in a pipeline was determined by the most complex instruction to be performed by the processor, i.e., the number of stages was fixed to that number of stages needed to perform the most complex instruction of the processor. Thus, each instruction propagated through a fixed number of stages of the pipeline, regardless of how simple or complex that instruction was. Also, each stage was executed in a single clock cycle, and thus the speed of the clock was based on the slowest stage of the pipeline. With each edge of the clock signal, the data associated with an instruction was advanced to the next stage to perform the next elemental operation.
Pipelining has been a useful technique for improving the performance of processors for many applications. A processor using RISC (reduced instruction-set computer) principles is a prime candidate for a pipelined architecture. In a RISC processor, the instruction set is generally limited to a small number of simple functions, and thus the pipeline can be optimized to execute each of the simple instructions very quickly. Pipelining is also advantageous for use in graphics processors for the same reason. A graphics processor uses a relatively small instruction set to perform a variety of graphic data transfer operations and to execute a plurality of graphics equations. Although the present invention is not limited to any particular processor application, the preferred embodiment described below is incorporated into a graphics processor, and thus background on graphics processors is deemed appropriate.
The advent of substantial hardware improvements combined with standardized graphics languages has allowed the use of complex graphics functions in even the most common applications. For example, word processors, spreadsheets, and desktop publishing packages are now beginning to take full advantage of the improvements in graphics capabilities to improve the user interface. Although sophisticated graphics packages have been available for computer aided drafting, design and simulation for years, three dimensional graphic displays are now common in games, animation and multimedia communication designed for personal computers.
The architecture of the personal computer system has advanced to handle the sophisticated graphic capabilities required by modern software applications. In the simplest of designs, a single CPU handled all data functions, including graphics functions. In more complicated architectures, a separate graphics processor is provided to perform all graphic functions in order to relieve the primary CPU of this duty and to free up the CPU to perform other operations. Generally, the graphics processor is connected between a computer system bus and the video or frame buffer. The frame buffer is the memory which stores the video data that is actually displayed on the video screen. A video controller is connected to the frame buffer to convert the digital rasterized data from the frame buffer to the analog signals needed by the display device. In other more sophisticated architectures, the frame buffer is directly connected to the system bus, either separately or as part of the main memory, and thus the main CPU as well as the graphics processor can access the frame buffer memory across the system bus.
A graphics processor generally performs data transfer operations and functions for drawing points, lines, polylines, text, string text, triangles, and polygons to the frame buffer. Furthermore, the graphics processor performs many graphics functions on the data within the frame buffer, such as patterning, depth cueing, color compare, alpha blending, accumulation, texture assist, anti-aliasing, supersampling, color masking, stenciling, panning and zooming, error correction, as well as depth and color interpolation, among other functions.
It is evident that the demand for greater graphic capabilities have increased dramatically, and that computer architectures have been improved to partially meet these demands. Also, graphics processors must be capable of performing more sophisticated functions in less amount of time in order to process the increasingly greater amounts of graphical data required by modern software applications. Although graphics processors typically use a pipelined architecture to improve speed and performance, the ever increasing demand for more sophisticated operations has required a greater amount of time for a given stage to execute, thereby reducing performance. As processing demands increase, there is a greater need for a processor with the capability to perform more sophisticated functions in a shorter amount of time. Therefore, there is a need for improved pipelining architectures to increase processor performance, both for graphics processors and for general purpose microprocessors.
In a processor incorporating a dynamic pipeline according to the present invention, the number of stages of the pipeline is varied depending upon the complexity of the instruction being performed. The dynamic pipeline includes a set of latches to separate the stages of the pipeline. The dynamic pipeline also includes a plurality of multiplexers which dynamically alter the data path to bypass corresponding latches based on the instruction. In this manner, the number of stages is reduced for simpler instructions, i.e., the pipeline is collapsed to perform the simpler instructions in less clock cycles. Therefore, collapsing the number of stages of the pipeline to perform the simpler instructions with less stages results in increased speed and performance of the processor. The maximum number of stages is used for more complex operations, such as alpha-blending in a graphics application processor, while less stages are used for simpler operations.
In the preferred embodiment, a circuit provides data to a first latch, which provides the latched data to a first operation element. The first operation element is preferably a multiplier for alpha blending. A data selector, which is preferably a multiplexer (mux), selects between the data from the circuit or the output of the first operation element and provides an output to a second latch. The second latch provides data to a second operation element. Control logic receives the instruction currently being executed and controls the data selector based on the instruction. In this manner, depending on the instruction currently being executed, the data selector can collapse the pipeline by bypassing the first latch and the multiplier.
The first and second latches are preferably formed of two aligned latches. Thus, the second latch may include a first latch which receives data from the data selector, and a second latch which receives data from a register or other data providing means. The second operation element, which is preferably an adder, either adds or subtracts the data output from the two aligned latches.
Another data selector, also preferably a multiplexer, is optionally included to simulate the addition of another stage by selecting between the register and the second operation element. The multiplexer selects only the register if an additional stage is not needed. However, if another stage is needed, the control logic controls the second data selector to alternately select between the register and the second operation element on consecutive clock cycles. Furthermore, the control logic controls the second operation element to select the desired operation to be performed by the adder on consecutive clock cycles. The last stage may alternatively be added by including separate latches and another operation element rather than switching the data selector.
In the preferred embodiment, a first circuit includes a first set of muxes which are used to determine the source of the incoming data as well as the logic operation to be performed by an arithmetic logic unit (ALU). A color source mux determines whether the incoming data is provided from an internal polyengine color interpolator, from internal color registers, or from an external color source, such as the host CPU or a local interface. The external source is also provided to a first-in first-out (FIFO) input which is used to synchronize the incoming data for pipeline. Two input muxes select the input data provided to the ALU, which performs logic functions on the incoming data.
A second circuit preferably comprises an alpha source mux which determines the source of an alpha value for alpha blending operations. The alpha value may be supplied from an internal interpolator, from predefined alpha registers or from an external source, such as the private or frame buffer memory. The output of the alpha source mux is provided to an alpha inverter, which determines whether the source value is amplified or attenuated. The output of the ALU is provided to a first latch and to an enable mux. The output of the alpha inverter is provided to a second latch, which is aligned with the first latch. The first and second circuits effectively form a first stage of the pipeline for providing data, but is not considered part of the dynamic portion of the pipeline.
The outputs of the first and second aligned latches are provided to the respective inputs of a multiplier having its output provided to one input of a multiplier select mux. This divides the first stage from a second stage of the pipeline. The output of the ALU is also provided to the other input of the enable mux, which provides its output to a second input of the multiplier select mux. The enable and multiplier select muxes form a data selector which is used to bypass the second stage of the pipeline for those operations not requiring multiplication. Control logic receives the instruction currently being executed and controls these muxes based on the instruction.
The output of the third latch is also provided to an adder. The other input of the adder receives the output of a fourth latch aligned with the third latch. The fourth latch receives an offset scalar value from a register. The third and fourth aligned latches separate the second stage from a third stage of the pipeline and provide latched data to the adder. These latches are always used in the preferred embodiment, even when the pipeline is fully collapsed.
An offset select mux provides its output to the fourth latch which receives the output of the adder at one input and the offset scalar value from the register connected to a second input. This simulates the addition of another stage where the offset select mux is controlled by the control logic to alternately select between the register and the adder on consecutive clock cycles. The output of the adder is provided to a color and pixel mask logic, which provides its output to an output FIFO. The output FIFO provides buffered outputs to the host data bus and to the local data bus.
The number of stages of the dynamic pipeline according to the present invention is dynamically changed as follows. Each of the first through fourth latches receives a clock input and therefore latches data from its input to its output on every clock cycle. In the preferred embodiment, the dynamic pipeline has four different modes, including a fast onepipe, a fast twopipe, a read-modify-write twopipe, and a threepipe mode. The fast onepipe mode is used for simple operations. To implement a fast onepipe, the enable mux selects the output of the ALU and the multiplier select mux selects the output of the enable mux to bypass the first and second latches and the multiplier. The offset select mux selects the offset register so that the adder adds the output of the ALU to the offset value on each clock cycle.
To implement a fast twopipe, the multiplier select mux selects the multiplier output, while the offset select mux remains selected to the offset value. In this manner, the outputs of the ALU and the alpha inverter are latched on each clock cycle by the first and second latches, respectively, the latched result is multiplied together by the multiplier, and this multiplied result is added to or subtracted from an offset scalar value after the third and fourth latches are clocked. For a twopipe including read-modify-write capability, pixel values are read from an external source and placed into an input FIFO, where the external pixel values are combined with internal pixel values in the ALU. Otherwise, the pipeline operates similarly to the fast twopipe.
Finally, to implement a threepipe pipeline, the offset select mux is chosen to alternate between the output of the adder and the offset register. Thus, the outputs of the ALU and alpha subtractor are latched and multiplied in a second stage, the offset value is latched through the fourth latch and subtracted from the multiplied value in a third stage, and the result is fed back to the fourth latch and added to a new multiplied value from the third latch in a fourth and final stage of the pipeline.
It is appreciated that since the number of stages of the dynamic pipeline can be varied on demand, simpler instructions can be executed much faster to improve the overall speed and performance of the processor. This is particularly advantageous in graphic processor design, so that graphic operations can be performed at a higher rate.