Computer Graphics and Rendering
Modern computer systems normally manipulate graphical objects as high-level entities. For example, a solid body may be described as a collection of triangles with specified vertices, or a straight line segment may be described by listing its two endpoints with three-dimensional or two-dimensional coordinates. Such high-level descriptions are a necessary basis for high-level geometric manipulations, and also have the advantage of providing a compact format which does not consume memory space unnecessarily.
Such higher-level representations are very convenient for performing the many required computations. For example, ray-tracing or other lighting calculations may be performed, and a projective transformation can be used to reduce a three-dimensional scene to its two-dimensional appearance from a given viewpoint. However, when an image containing graphical objects is to be displayed, a very low-level description is needed. For example, in a conventional CRT display, a xe2x80x9cflying spotxe2x80x9d is moved across the screen (one line at a time), and the beam from each of three electron guns is switched to a desired level of intensity as the flying spot passes each pixel location. Thus at some point the image model must be translated into a data set which can be used by a conventional display. This operation is known as xe2x80x9crendering.xe2x80x9d In general, this application refers to rendering as the processes which include rasterization and following steps.
The graphics-processing system typically interfaces to the display controller through a xe2x80x9cframe storexe2x80x9d or xe2x80x9cframe bufferxe2x80x9d of special two-port memory, which can be written to randomly by the graphics processing system, but also provides the synchronous data output needed by the video output driver. (Digital-to-analog conversion is also provided after the frame buffer.) Such a frame buffer is usually implemented using SDRAM or SGRAM memory chips (or sometimes with VRAM or DRAM and special controllers). This interface relieves the graphics-processing system of most of the burden of synchronization for video output. Nevertheless, the amounts of data which must be moved around are very sizable, and the computational and data-transfer burden of placing the correct data into the frame buffer can still be very large.
Even if the computational operations required are quite simple, they must be performed repeatedly on a large number of datapoints. For example, in a typical 1998 high-end configuration, a display of 1280xc3x971024 elements may need to be refreshed at 72 Hz, with a color resolution of 24 or more bits per pixel. If blending is desired, additional bits (e.g. another 8 bits per pixel) will be required to store an xe2x80x9calphaxe2x80x9d or transparency value for each pixel. This implies manipulation of more than 3 billion bits per second, without allowing for any of the actual computations being performed. Thus it may be seen that this is an environment with unique data manipulation requirements.
If the display is unchanging, no demand is placed on the rendering operations. However, some common operations (such as zooming or rotation) will require every object in the image space to be re-rendered. Slow rendering will make the rotation or zoom appear jerky. This is highly undesirable. Thus efficient rendering is an essential step in translating an image representation into the correct pixel values. This is particularly true in animation applications, where newly rendered updates to a computer graphics display must be generated at regular intervals.
The rendering requirements of three-dimensional graphics are particularly heavy. One reason for this is that, even after the three-dimensional model has been translated to a two-dimensional model, some computational tasks may be bequeathed to the rendering process. (For example, color values will need to be interpolated across a triangle or other primitive.) These computational tasks tend to burden the rendering process. Another reason is that since three-dimensional graphics are much more lifelike, users are more likely to demand a fully rendered image. (By contrast, in the two-dimensional images created e.g. by a GUI or simple game, users will learn not to expect all areas of the scene to be active or filled with information.)
FIG. 2 is a very high-level view of other processes performed in a 3D graphics computer system. A three dimensional image which is defined in some fixed 3D coordinate system (a xe2x80x9cworldxe2x80x9d coordinate system) is transformed into a viewing volume (determined by a view position and direction), and the parts of the image which fall outside the viewing volume are discarded. The visible portion of the image volume is then projected onto a viewing plane, in accordance with the familiar rules of perspective. This produces a two-dimensional image, which is now mapped into device coordinates. It is important to understand that all of these operations occur prior to the operations performed by the rendering subsystem of the present invention. FIG. 3 is an expanded version of FIG. 2, and shows the flow of operations defined by the OpenGL standard.
A vast amount of engineering effort has been invested in computer graphics systems, and this area is one of increasing activity and demands. Numerous books have discussed the requirements of this area; see, e.g., ADVANCES IN COMPUTER GRAPHICS (ed. Enderle 1990-); Chellappa and Sawchuk, DIGITAL IMAGE PROCESSING AND ANALYSIS (1985); COMPUTER GRAPHICS HARDWARE (ed. Reghbati and Lee 1988); COMPUTER GRAPHICS: IMAGE SYNTHESIS (ed. Joy et al.); Foley et al., FUNDAMENTALS OF INTERACTIVE COMPUTER GRAPHICS (2.ed. 1984); Foley, COMPUTER GRAPHICS PRINCIPLES and PRACTICE (2.ed. 1990); Foley, INTRODUCTION TO COMPUTER GRAPHICS (1994); Giloi, Interactive Computer Graphics (1978); Hearn and Baker, COMPUTER GRAPHICS (2.ed. 1994); Hill, COMPUTER GRAPHICS (1990); Latham, DICTIONARY OF COMPUTER GRAPHICS (1991); Magnenat-Thalma, IMAGE SYNTHESIS THEORY and PRACTICE (1988); Newman and Sproull, PRINCIPLES OF INTERACTIVE COMPUTER GRAPHICS (2.ed. 1979); PICTURE ENGINEERING (ed. Fu and Kunii 1982); PICTURE PROCESSING and DIGITAL FILTERING (2.ed. Huang 1979); Prosise, HOW COMPUTER GRAPHICS WORK (1994); Rimmer, BIT MAPPED GRAPHICS (2.ed. 1993); Salmon, COMPUTER GRAPHICS SYSTEMS and CONCEPTS (1987); Schachter, COMPUTER IMAGE GENERATION (1990); Watt, THREE-DIMENSIONAL COMPUTER GRAPHICS (2.ed. 1994); Scott Whitman, MULTIPROCESSOR METHODS FOR COMPUTER GRAPHICS RENDERING; the SIGGRAPH PROCEEDINGS for the years 1980-1994; and the IEEE Computer Graphics and Applications magazine for the years 1990-1997; all of which are hereby incorporated by reference.
Background: Graphics Animation
In many areas of computer graphics a succession of slowly changing pictures are displayed rapidly one after the other, to give the impression of smooth movement, in much the same way as for cartoon animation. In general the higher the speed of the animation, the smoother (and better) the result.
When an application is generating animation images, it is normally necessary not only to draw each picture into the frame buffer, but also to first clear down the frame buffer, and to clear down auxiliary buffers such as depth (Z) buffers, stencil buffers, alpha buffers and others. A good treatment of the general principles may be found in Computer Graphics: Principles and Practice, James D. Foley et al., Reading MA: Addison-Wesley. A specific description of the various auxiliary buffers may be found in The OpenGL Graphics System: A Specification (Version 1.0), Mark Segal and Kurt Akeley, SGI.
In most applications the value written, when clearing any given buffer, is the same at every pixel location, though different values may be used in different auxiliary buffers. Thus the frame buffer is often cleared to the value which corresponds to black, while the depth (Z) buffer is typically cleared to a value corresponding to infinity.
The time taken to clear down the buffers is often a significant portion of the total time taken to draw a frame, so it is important to minimize it.
Background: Parallelism in Graphics Processing
Due to the large number of at least partially independent operations which are performed in rendering, many proposals have been made to use some form of parallel architecture for graphics (and particularly for rendering). See, for example, the special issue of Computer Graphics on parallel rendering (September 1994). Other approaches may be found in earlier patent filings by the assignee of the present application and its predecessors, e.g. U.S. Pat. No. 5,195,186, and published PCT applications PCT/GB90/00987, PCT/GB90/01209, PCT/GB90/01210, PCT/GB90/01212, PCT/GB90/01213, PCT/GB90/01214, PCT/GB90/01215, and PCT/GB90/01216, all of which are hereby incorporated by reference.
Background: Pipelined Processing Generally
There are several general approaches to parallel processing. One of the basic approaches to achieving parallelism in computer processing is a technique known as pipelining. In this technique the individual processors are, in effect, connected in series in an assembly-line configuration: one processor performs a first set of operations on one chunk of data, and then passes that chunk along to another processor which performs a second set of operations, while at the same time the first processor performs the first set operations again on another chunk of data. Such architectures are generally discussed in Kogge, THE ARCHITECTURE OF PIPELINED COMPUTERS (1981), which is hereby incorporated by reference.
Background: The OpenGL(trademark) Standard
The xe2x80x9cOpenGLxe2x80x9d standard is a very important software standard for graphics applications. In any computer system which supports this standard, the operating system(s) and application software programs can make calls according to the OpenGL standards, without knowing exactly what the hardware configuration of the system is.
The OpenGL standard provides a complete library of low-level graphics manipulation commands, which can be used to implement three-dimensional graphics operations. This standard was originally based on the proprietary standards of Silicon Graphics, Inc., but was later transformed into an open standard. It is now becoming extremely important, not only in high-end graphics-intensive workstations, but also in high-end PCs. OpenGL is supported by Windows NT(trademark), which makes it accessible to many PC applications.
The OpenGL specification provides some constraints on the sequence of operations. For instance, the color DDA operations must be performed before the texturing operations, which must be performed before the alpha operations. (A xe2x80x9cDDAxe2x80x9d or digital differential analyzer, is a conventional piece of hardware used to produce linear gradation of color (or other) values over an image area.)
Other graphics interfaces (or xe2x80x9cAPIsxe2x80x9d), such as PHIGS or XGL, are also current as of 1995; but at the lowest level, OpenGL is a superset of most of these.
The OpenGL standard is described in the OPENGL PROGRAMMING GUIDE (1993), the OPENGL REFERENCE MANUAL (1993), and a book by Segal and Akeley (of SGI) entitled THE OPENGL GRAPHICS SYSTEM: A SPECIFICATION (Version 1.0), all of which are hereby incorporated by reference.
FIG. 3 is an expanded version of FIG. 2, and shows the flow of operations defined by the OpenGL standard. Note that the most basic model is carried in terms of vertices, and these vertices are then assembled into primitives (such as triangles, lines, etc.). After all manipulation of the primitives has been completed, the rendering operations will translate each primitive into a set of xe2x80x9cfragments.xe2x80x9d (A fragment is the portion of a primitive which affects a single pixel.) Again, it should be noted that all operations above the block marked xe2x80x9cRasterizationxe2x80x9d would be performed by a host processor, or possibly by a xe2x80x9cgeometry enginexe2x80x9d (i.e. a dedicated processor which performs rapid matrix multiplies and related data manipulations), but would normally not be performed by a dedicated rendering processor such as that of the presently preferred embodiment.
Background: Bit-Blit
Bit-blit, also written as bit blit and bitblt, is a pixel block copying procedure. The term xe2x80x9cbitbltxe2x80x9d is short form for xe2x80x9cbit block transfer.xe2x80x9d One of the most common uses of the bit-blit is in copying pixels from the back framebuffer, where they were written by the graphics processor, to the front framebuffer, from where they will be scanned and displayed. Blitting is also used to simply move a block of pixels from one set of memory locations to another, which effectively moves those pixels on the display, e.g. scrolling of text or moving a window on the screen.
Background: Multiple High-Performance Graphics Processors
One method of increasing graphics throughput is to combine multiple graphics processors in one system, and to distribute the graphics processing between them. One common method of distributing graphics jobs between processors is to assign alternating scanlines (or alternating multiple-scanline xe2x80x9cstripesxe2x80x9d) to each processor, then reading each framebuffer in turn in order to display the resulting data.
One problem with this sort of multiprocessor system arises when a bit blit is sought to be performed on a set of pixels which encompass more than one scanline. Because scanline boundaries are the common divisions between processors, this means that the blit operation may require the memories of multiple processors to be read from or written to. In particular, to maintain the ability to perform a logical operation concurrently with a bit-blit, both the source and destination memories must be read, and the destination memory must be then written to.
Innovative System and Preferred System Context
The present invention provides a new approach to these needs. In the preferred embodiment, each of the multiple processors is able to fully support multi-processor operation using only PCI read operations between processors. According to the innovative method described more fully below, each of the graphics processors performs its operations on its respective scanlines, and writes to its own framebuffer, but the need for writes from one processor to the framebuffer of another processor is eliminated.
This innovative multi-processor bit-blit system is presented, in the presently preferred embodiment, in the context of multiple 3Dlabs GLINT(copyright) pipelined graphics processors, many details of which may be found in other issued 3Dlabs patents, e.g. U.S. Pat. Nos. 5,701,111, 5,272,192, 5,594,854, 5,777,629, 5,798,770, 5,764,243, all of which are hereby incorporated by reference. The preferred embodiment provides a graphics processing chip which uses a deep pipeline of multiple asynchronous units, separated by FIFOs, to achieve a high net through-put in 3D rendering. Besides the output interface to the frame buffer, a separate interface is to a local buffer which can be used for data manipulation (such as Z-buffering). Preferably reads and writes to the local buffer are provided by separate stages of the pipeline. Preferably some of the individual units include parallel paths internally. Preferably some of the individual units are connected to look ahead by more than one stage, to keep the pipeline filled while minimizing the use of expensive deep FIFOs.
The graphics management chip provided by the presently preferred embodiment implements the low-level rasterizing functions of OpenGL, together with some additional functions which aid in management of two-dimensional rendering to serve the graphical user interface.
The message-passing architecture of the presently preferred embodiment provides a long pipeline, in which the individual stages of the pipeline operate asynchronously. To optimize performance, stages of the pipeline may have internally parallel structure. (However, this is a basically quite different processing paradigm from the parallel rendering environments being explored by other developers.)
Where possible, data is kept on chip (registered) between blocks. However, of course, memory access is sometimes necessary. Thus, although most of the blocks are two-port blocks, some are multi-port to permit memory access. FIFO buffering is typically used for interface between the blocks. In many cases, one-deep FIFO""s can be used, with appropriate look-ahead connections for timing control. However, in other stages, significantly deeper FIFO""s are used, to avoid xe2x80x9cbubblesxe2x80x9d in the pipeline and optimize processor utilization.
The overall architecture of this innovative chip is best viewed using the software paradigm of a message passing system. In this system all the processing blocks are connected in a long pipeline with communication with the adjacent blocks being done through message passing. Between each block there is a small amount of buffering, the size being specific to the local communications requirements and speed of the two blocks.
The message rate is variable and depends on the rendering mode. The messages do not propagate through the system at a fixed rate typical of a more traditional pipeline system. If the receiving block can not accept a message, because its input buffer is full, then the sending block stalls until space is available.
The message structure is fundamental to the whole system as the messages are used to control, synchronize and inform each block about the processing it is to undertake. Each message has two fieldsxe2x80x94a data field and a tag field. The data field will hold color information, coordinate information, local state information, etc. The tag field is used by each block to identify the message type so it knows how to act on it.
A particular advantage of this architecture is that it inherently provides a very high degree of design for testability. Moreover, this is achieved without adding any special diagnostic hardware paths or registers. By providing appropriate commands to the chip, any desired input can be sent to any block within the pipeline. Thus modifications to the architecture can be tested very rapidly, and debugging can rapidly pinpoint any faults which may be present.
A particular advantage of this architecture is that it permits a very efficient test strategy: each unit can be taken out of the message stream and tested in isolation. This is possible because the interactions are all though the messages, and each unit does not know or care where the messages come from. Thus testing software can generate streams of messages as stimulus, and can check the resulting messages coming out against what the specified behavioral model defines. The input and output timings are varied to force the internal states to run in a blocked or non-blocking modes to further increase the test coverage. Moreover, the test coverage can be ascertained (both at the C statement level in the simulator and at the VHDL level), so that the comprehensiveness of the tests is not an unknown.