In addition to those discussed above and below, this application is related to the following copending U.S. patent applications, each of which is incorporated herein, in their entireties, by reference and filed on even date herewith:
Application Ser. No. 09/354,083, filed Jul. 15, 1999 entitled, xe2x80x9cGRAPHICS PROCESSING WITH TRANSCENDENTAL FUNCTION GENERATORxe2x80x9d naming Vernon Brethour and Stacy Moore as inventors; and
Application Ser. No. 09/354,217, filed Jul. 15, 1999 entitled xe2x80x9cGRAPHICS PROCESSING FOR EFFICIENT POLYGON HANDLINGxe2x80x9d naming Dale Kirkland and William Lazenby as inventors.
The present invention relates to computers, and more particularly to computers using very large instruction words for various purposes, including for graphics processing.
In the implementation of graphics display systems for digital computers, it is sometimes desirable to have dedicated hardware support for geometry calculations in addition to the more common support for triangle setup and rasterization. Because graphics display systems often involve the display of objects based on three-dimensional data describing the objects, the geometry calculations involve, among other things, transforming locations of objects expressed in three-dimensional world coordinates into locations expressed in two-dimensional coordinates as the objects appear on the display. For some applications and configurations of graphics systems, the processing capability of the geometry accelerator becomes critically important. In the simplest case, geometry computations are accomplished one coordinate at a time, one vertex at a time, one triangle a time, one triangle strip at a time.
Data presented to a computer graphics subsystem are often expressed as strips of polygons (often triangles) in accordance with a graphics processing standard, such as the well known OpenGL graphics library. Rendering a scene involves transforming the coordinates of all of the polygons in all of the strips and determining the pixel values in the display that are associated with each portion of each of the polygons that appears in the display. The large amount of data involved in these calculations, in relation to the conflicting goals of achieving rendering both quickly and in detail, places heavy demands on computational resources.
Substantial opportunities exist for parallel computation by breaking up the triangle strips and presenting the resulting sub-strips to different computation engines in parallel. THE REALITY ENGINE, distributed by Silicon Graphics, Inc. of Mountain View, Calif., and the GLZ family of graphics accelerators, distributed by INTENSE 3D of Huntsville, Ala., are examples of systems that employ this technique extensively. In these systems, once the strips are broken up, the sub-strips are passed to standard processor elements, where the rest of the computation takes place basically one coordinate at a time, one vertex at a time. In the Reality Engine, these computations are done with an i860 processor from Intel. In the GLZ family of graphics accelerators, these computations are done with DSP chips from Analog Devices of Norwood, Mass. In systems like these, some limited parallelism takes place in the coordinate transformations because the computation engines employed are pipelined math units with separate engines for integer and floating point calculations.
In U.S. Pat. No. 5,745,125, assigned to Sun Microsystems, separate specialized computation engines are arranged in series to form a deeper pipeline than would normally occur.
It is a known goal in computer design to employ very large instruction words (VLIW) for achieving increased parallelism in computation. To make it practical to program such computers, high level programming languages are devised that employ instructions utilizing a register-to-register type of instruction set. The effect of a successful VLIW machine is to launch and complete a great many instructions on each clock cycle, so the register-to-register instruction set requires a register file with many read ports and many write ports. For example, U.S. Pat. No. 5,644,780, assigned to International Business Machines, describes a register file for VLIW with 8 write ports and 12 read ports. The result is a VLIW computation engine capable of high levels of parallelism, but which can be built only at great cost that requires many registers.
The present invention achieves high levels of parallelism in a graphics processor by providing in a first embodiment an apparatus for processing computer graphics requests utilizing a wide word instruction. The apparatus of this embodiment has
1. a graphics request input;
2. a processor, coupled to the graphics data input, having an output, and responsive to instructions, wherein each instruction is a wide word. In a further related embodiment, each instruction is a very wide word. In a further embodiment, each instruction is a super wide word. In a still further embodiment, each instruction is an ultra wide word. In a related embodiment, which may, but need not, employ an instruction that is a wide word, a very wide word, a super wide word or an ultra wide word, the processor has functional units producing n results per clock cycle and registers for storing not more than n/2 of such results. In a further related embodiment, the functional units are connected by a cross bar.
As used in this description and the accompanying claims, unless the context otherwise requires, the following definitions are employed. A xe2x80x9cwide wordxe2x80x9d is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 64 bits of control to the processor. A xe2x80x9cvery wide wordxe2x80x9d is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 99 bits of control to the processor. A xe2x80x9csuper wide wordxe2x80x9d is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 128 bits of control to the processor. An xe2x80x9cultra wide wordxe2x80x9d is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 255 bits of control to the processor. A xe2x80x9cregisterxe2x80x9d is a storage element associated with a processor permitting reading of data on the processor clock cycle that immediately follows the clock cycle in which storage has been accomplished.
In another embodiment of the invention, there is provided a computer having data stores with multiple addressing modes. In this embodiment, the computer has
1. a data input;
2. a processor, coupled to the data input, and having an output and responsive to instructions; and
3. a plurality of data stores, coupled to the processor, each data store having a plurality of addressing modes, wherein a single instruction individually selects an addressing mode for each of the data stores.
The plurality of addressing modes may include indirect and absolute addressing modes. The indirect mode may further include a double level of indirect addressing. In a further embodiment, each instruction is a wide word. In a still further embodiment, there is provided a computer for processing computer graphics requests, wherein the data input is a graphics request input.
In another embodiment, there is provided a multiple processor apparatus for processing computer graphics requests in which the control store is accessed an increased clock rate in relation to the clock rate of the processors. In this embodiment, the apparatus has:
1. a plurality n of processors, n greater than 1, each processor running at a processor clock rate R; and
2. a single control store supplying instructions for the processors, running at a store clock rate nR.
In a related embodiment, the processors are responsive to instructions, and each instruction is a wide word or (in a yet further embodiment) a super wide word. Another related embodiment also has a control store sequencer, for evaluating branch instructions, at a clock rate nR, so that each processor may be caused to branch without processor clock delay for evaluation of branch instructions.
Another embodiment of the invention provides an apparatus, for processing computer graphics requests, that uses a stack for storing instruction addresses arranged so as not to produce an overflow condition. This embodiment has
1. a graphics request input;
2. a processor, coupled to the graphics request input, and having an output and responsive to a set of instructions, the set including a call and a return;
3. a stack of n entries for storing instruction addresses, the stack having a top entry;
4. a program counter for addressing instructions and having a value for a current instruction, wherein
i. each time a call is invoked, a number equal to one more than the value of the program counter is pushed onto the top of the stack and
ii. each time a return is invoked, the top entry of the stack is popped off the stack and placed in the program counter;
5. wherein program execution is maintained even when the excess of calls over returns is greater than n, so that current entries in the stack may be abandoned by invoking an instruction stream that is independent of return addresses in the stack. In a further related embodiment, entries in the stack are addressed by a LIFO system.
In accord with another aspect of the invention, a graphics accelerator includes a vertex input for receiving vertex data, an output for forwarding processed data, and a processor coupled with the vertex input and output. The graphics accelerator also includes an instruction input that receives instructions for processing the vertex data received from the vertex input. The processor is responsive to wide word instructions.
In accordance with yet another aspect of the invention, a graphics accelerator includes a vertex input for receiving vertex data, a processor coupled with the input, and a set of registers for storing results produced by the processor. To that end, the processor includes a plurality of functional units that execute based upon a clock cycle. The plurality of functional units produce n results per each clock cycle. The set of registers includes no more than n/2 registers.
In preferred embodiments, n greater than 1. Moreover, the graphics accelerator is responsive to a set of instructions that may include one of a plurality of wide words, super wide words, and ultra wide words.