Computer graphics systems are frequently used to model a scene having three-dimensional objects and display the scene on a two-dimensional display device (31) such as a cathode ray tube or liquid crystal display. Typically, the three-dimensional objects of the scene are each represented by a multitude of polygons (or primitives) that approximate the shape of the object. Rendering the scene for the display on the two-dimensional display device is a computationally intensive process. It is therefore frequently a slow process, even with today's microprocessors and graphics processing devices.
Referring to FIG. 1, a multiprocessor is a machine containing more than one data processor (e.g., host processor (32) and geometry processor (38)). The processors may be connected to each other by a bus or by a cross bar switch. Each of the processors may have an associated cache memory. The host processor and the geometry processor share a common system memory (33) through the bus or cross bar switch and the associated cache (if provided). Each processor may also have a private or local memory (36) that is not accessible to the other processors.
Each of the processors of the multiprocessor may execute an associated task. For example, an audio application or task may run on one processor while a video application may run on another processor. In this case each processor executes its task in a substantially independent manner without any strong interaction between the tasks running on the other processors.
In other cases, of most interest to this invention, a single task is partitioned into sub-tasks that are then executed cooperatively on two or more processors by assigning one processor to one sub-task. When several processors cooperate in this manner to execute a single task, they typically need to share, in a fair manner, common resources such as memory, as well as buffers, printers, and other peripherals. In addition, the processors typically need to communicate with one another so as to share information needed at checkpoints, to wait for other processors to complete a certain routine, to signal to other processors that the processor is done with its assigned sub-task, etc.
A master-slave system is the simplest implemented system that supports multiprocessing within a single job. In this master-slave system, a predefined processor is declared the master and is permitted to execute the operating system. The other processors, denoted slaves, may execute only user applications. In practice a master-slave system permits the slaves to perform some easily paralleled operating system functions. Unlike separate supervisors, a master-slave system permits true parallelism within a single job, but only for user applications. The operating system itself is essentially serial, with all but the most trivial functions executed on the unique master processor. For a modest number of processors and a computationally heavy work load, paralleling the user's applications may be adequate, and the master-slave system has the advantage of simplicity over the more ambitious symmetric systems.
The generic 3D graphics system as shown in FIG. 1 consists of a host processor (32) and a graphics adapter (30) as a master-slave system with the host processor being the master and the graphics adapter being the slave. For most applications, part of the application runs on the host processor, and the rest runs on the graphics adapter. FIG. 2 shows the processing steps for rendering an object in a 3D graphics system shown in FIG. 1. The host creates work items and feeds them to the graphics adapter. With existing schemes the host processor waits for the graphics adapter (14) to return a status for the current work item before proceeding with the next work item (11) in the application. In such a master-slave system, the application goes through the scene database and sends down a set of objects (11) to be rendered along with its material properties, and attributes (12) such as location and, rasterization parameters, etc., to the graphics rendering system. See [Foley], which is incorporated herein by reference. Material properties could include shininess, emissivity, ambient color, diffuse color, specular color, texture, of the model. This work is done on the host processor being the master and corresponds to the first 3 boxes on the left column (11, 12, 13) in FIG. 2. The graphics rendering is performed on the graphics adapter being the slave, and this rendering corresponds to the work listed in all the boxes on the middle and right columns in FIG. 2. The graphics adapter receives objects from the host (15) and does the following. It first transforms the position of the objects from model coordinates into normalized device coordinates (NDC) (16). Referring to FIG. 3, the view volume (31), defined in NDC, determines which portion of space is visible to the viewer. Objects falling outside the view volume (41) are discarded from further processing (trivial rejection). Objects entirely contained in the view volume (42) remain unchanged and are sent to the lighting stage (trivial acceptance). Objects that intersect the boundary of the view volume (43, 44) are clipped against the view volume, i.e., split into a portion inside and a portion outside the view volume; the inside portion is then sent to the lighting stage (19). FIG. 3 shows the trivial accept (42), trivial reject (41), and clipped cases (43, 44). The application sending the objects often can make use of the trivial accept/reject/clip (status) information (21 of FIG. 2) for the object to
1. decide whether other objects need to be sent for the current frame by using hierarchical geometric models or by using inter-object visibility information, PA1 2. decide whether this object needs to be sent for the next frame by using frame to frame coherence, PA1 3. determine whether the object needs to be lighted, and PA1 4. determine the level of tessellation required for the current or next frame.
The status returned could also be used by the host processor in several ways. For example, if previous data was rejected trivially the application may know that current data is also likely to be rejected because of the way the model is stored. So it may then decide to tessellate the current data (i.e., break up the geometric model into triangles) at a coarser level, i.e., with fewer triangles and speed up overall processing. The other rationale for this coarser tessellation is that, even if part of the current data is accepted (not rejected trivially), the triangles for the current data are likely to be near the corner of the viewing region, and the system may render such models with less detail, i.e., with fewer triangles.
Similarly, a status of trivial accept could signal to the application that objects have to drawn with more detail. For example, an application could send down a teapot with 100 triangles, and if the return status says that this teapot was accepted trivially, it can send down the same teapot with 5000 triangles. The teapot sent last would draw on top of the previous teapot. The ratio of accepts to rejects may also help the application to restructure the processing algorithm itself. The application may use bounding boxes around objects and do application level clipping if it knows that a high proportion of the objects are being rejected. If a high ratio of objects is being accepted, the application may turn off bounding box based clipping. Since the application responds to the clipping status as described above, after sending a work item, consisting of a group of objects, the application waits (14) for the graphics adapter to return the clipping status (21) for the work item, so the host processor is actually executing the flow diagram shown in FIG. 2.
Referring to FIG. 2, after the clipping stage (18), the graphics adapter performs lighting (19), perspective transformation and projection (20), for all vertices in the group of objects. Then the clipping status is returned to the host (21). Finally, it sends the group of objects (22) to the rasterizer (37 of FIG. 1) for rasterization (23) or scan conversion and fragment processing after which they appear on the display (31 of FIG. 3).
Let us define t1 as the time spent by the host to fetch a set of objects from the application, t2 as the time spent by the host to determine the properties and attributes of a set of objects, t3 and t4 as the time it takes to transfer a set of objects from the host to the graphics adapter, t5 as the time spent by the graphics adapter to carry out model and view transformations, t6 as the time spent by the graphics adapter to determine the clipping status, t7 and t8 as the time spent by the graphics adapter to do the clipping and lighting calculations respectively, t9 as the time spent by the graphics adapter to carry out the perspective transformation and division as well as projection, t10 as the time it takes to return the clipping status from the graphics adapter to the host, t11 as the time spent by the graphics adapter to send a set of objects to the rasterizer. The problem with the method outlined thus far is that the graphics adapter usually takes a significantly longer period of time (about 10 times longer) (t4+t5+t6+t7+t8+t9) to process the work item than it takes for the host to generate it (t1+t2+t3). Thus, in the flow diagram for the host in FIG. 2, the host spends most its time waiting for the status (tn=t4+t5+t6+t7+t8+t9+t10-t1-t2-t3) while the graphics adapter is doing its work as shown in FIG. 2. So instead of going to work on the next set of objects, the host is wasting its processing power. With rapid advances in CPU design even small waits for the processor mean a lot of wasted capacity. The traditional solution to this problem is to use task switching on the CPU rather than wait for the graphics adapter to return status. However, this is not a good idea because the time taken for processing a work item is much smaller (in the order of milliseconds on current CPUs) than a process slice in the operating system (20-55 milliseconds). Thus, a solution that puts the host process to sleep as soon as it sends a work item to the graphics adapter and that wakes up the host process when the return status is available from the graphics adapter will have enormous software overheads and therefore be terribly inefficient.
In a master-slave graphics system as just described, applications communicate with the graphics subsystem through a graphics API (Application Programming Interface). Besides providing an unified interface to the functionality of the graphics pipeline (FIG. 2), the API also encapsulates and thereby hides the implementation details of how the graphics pipeline is distributed between the host (master) and the graphics adapter (slave) and of how the host and the graphics adapter communicate. Some graphics APIs, e.g. Direct 3D (D3D) from Microsoft Corporation, provide feedback about the operations performed by the API for priorities submitted by the applications. For instance, Direct 3D informs the application whether a primitive is needed to be clipped against the view frustums. This information is provided as a return value from the function called by the application to submit a primitive. Unfortunately, determining this status return value requires substantial processing within the graphics subsystems, resulting in undue delay before returning from the API call to the application. Since the application is blocked while waiting for the API call to return, an inefficient implementation of this feature will result in slow performance of an application written to such APIs.