The simulation and rendering of massive crowds of characters with a high level of detail from a variety of viewpoints has presented a difficult scene management challenge to the gaming and general graphics community. The latest generations of commodity graphics processing units (GPUs) demonstrate incredible increases in geometry performance, especially with the inclusion of a GPU tessellation pipeline.
Nevertheless, even with state-of-the-art graphics hardware, rendering thousands of complex characters (objects) with high polygonal counts at interactive rates is very difficult and computationally expensive. These characters may be very small (virtually invisible) or not visible at all. The rendering of such invisible or virtually invisible characters with over a million polygons each on the screen can severely impede performance and waste critical computing resources. Some methods of object culling and level of detail (LOD) techniques are required in order to eliminate or minimize the rendering of these invisible or virtually invisible objects.
Graphics rendering is computationally demanding. GPUs typically possess nearly an order of magnitude more computing resources than central processing units (CPUs). This has prompted an increasing interest is using GPUs to perform more general types of calculation. For example, in game applications, many of the calculations used to drive the objects in the game world (such as physics simulation or artificial intelligence) may be accelerated by moving them to the GPU. Doing so, however, complicates the scene management techniques which must be used for efficient rendering.
FIG. 1 is an example of a conventional graphics system 100. A typical graphics system consists of one or more host central processing units (CPU), a GPU, and corresponding memories (the host and graphics memories may be physically separate, or they may be shared). A graphics application or application program (AP), executing on a host CPU, issues commands to the GPU by means of a graphics application programming interface (API) such as OpenGL or Microsoft's DirectX, which provides an abstract set of commands. The API implementation forwards the commands to a device-specific driver, which is responsible for translating them into the form that can be executed by the GPU.
The programming model of a graphics system is as follows. The CPU is responsible for issuing rendering commands to the GPU, such as configuring the various pipeline stages, or issuing primitives to the graphics pipeline. A primitive is a geometric entity consisting of a set of vertices. The set of supported primitives includes, but is not limited to: points (a single vertex), lines (a pair of vertices), and triangles (three vertices). For each vertex, primitive, or pixel generated by a given rendering command, corresponding application defined programs are invoked by the hardware to perform calculations needed for rendering.
A vertex shader (VS) is a GPU program which is invoked for individual primitive vertices. Each VS invocation obtains a set of attributes for a single input vertex, performs user programmable calculations, and generates an output vertex. The input vertex data is generally retrieved from a vertex buffer (input buffer), which is typically located in graphics memory.
A geometry shader (GS) is a GPU program which is invoked for individual geometric primitives. Each GS invocation receives the VS outputs for a single primitive, performs user programmable calculations, and emits a variable number of output primitives, (or it may not emit any). The GS may be configured for stream output which causes all primitives emitted from the geometry shader to be written consecutively (and in order) to a vertex buffer. The GPU maintains a counter which tracks the number of primitives emitted to a particular buffer.
An API mechanism exists to cause the GPU to re-issue a set of primitives that were previously emitted by a geometry shader using the stored primitive count. This is presented to the application as a special graphics command which is issued by the CPU. For example, in Microsoft's DX10, this is known as a DrawAuto command. In addition, the number of primitives which would be issued by a DrawAuto call may be queried by the application to determine the number of object instances to ultimately render (draw).
Graphics APIs also include support for geometry instancing, whereby multiple copies of a single batch of geometry are issued to the graphics pipeline. With instance rendering, a separate vertex buffer (input buffer) may be used to supply per-instance data to the VS. The VS will be invoked multiple times with the same vertex data, but each invocation of a particular vertex is supplied with different instance data. Geometry instancing is the preferred way to render numerous copies of identical objects in different locations or configurations, because a large number of objects may be submitted for rendering with minimal CPU overhead.
The various shader stages (vertex, geometry, pixel) may be implemented as different threads sharing the same physical hardware, (as is the case in current GPUs such as Advanced Micro Device's ATI Radeon HD4870), or as separate physical hardware (as was the case in earlier generation GPUs). Typically, the programmable stages are vectorized, and operate on a number of elements (primitives and vertices) in parallel (but this does not have to be the case).
Culling and LOD are integral parts of a modern rendering engine. Culling is the process of identifying objects in the scene which are not visible and excluding them so that they will not be rendered. This is generally accomplished by performing some form of visibility test on a bounding volume which encloses the object. LOD refers to the use of simplified geometry or shading for visible objects with less visual importance. These techniques, collectively, are sometimes referred to as “scene management.” Given a set of objects to be rendered, it is the job of a rendering engine to identify the visible objects, to assign a level of detail to each visible object, and to issue the necessary rendering commands to the GPU.
The most common type of culling, known as view frustum culling, uses a geometric test to exclude objects which lie outside the field of view of the camera. In current systems, this culling test is performed on the host CPU, prior to submitting the object to the GPU for rendering. Objects which fail the test are simply not submitted.
Another kind of culling is occlusion culling. Occlusion culling eliminates objects that are not visible on a display screen because they are blocked by other objects, such as when a character moves behind a building in a game. One common technique is to separate the scene into fixed regions, and to pre-compute, for each region, the set of regions potentially visible from it (called a potentially visible set or PVS). Objects which do not lie in the PVS of the region containing the camera are simply not rendered. This method requires expensive preprocessing of the scene in order to be effective, and therefore is not suitable for highly dynamic environments.
Modern APIs such as Direct3D 10 also provide conditional rendering functionality which may be used for occlusion culling. To use this technique, the bounding volume of an object is rasterized and compared to a Z-buffer containing the depths of the occluders. The API commands to render the actual object can be issued conditionally, so that they are only carried out if at least one bounding volume pixel is not occluded. This technique can provide effective culling in arbitrary scenes, but it requires the CPU to issue several rendering commands per object, which can quickly create a performance bottleneck.
LOD selection is also typically implemented using the CPU. In a typical LOD system, objects which are determined to be visible may be partitioned into multiple groups, based on their visual importance. One way to do this is to partition the objects based on their distance from the camera. The less important groups may then be rendered in a less costly way (for example, by using simpler geometric models).
Existing implementations of object culling and LOD selection typically perform many of their calculations using the CPU, which imposes a serious performance bottleneck. Occlusion culling using conditional rendering can be particularly harmful to system performance, since it makes it impossible to leverage geometry instancing. Furthermore, CPU based implementations require that the positions of the objects be made available to the CPU. This complicates GPU-based simulation of objects, because it may require expensive data transfers from graphics memory to host memory.
A scalable method for implementing frustum culling, occlusion culling, LOD or other types of object culling is needed, that is compatible with GPU-based simulation of the objects. Such a method should not require any scene preprocessing or occluder selection (anything that is rendered earlier should function as an occluder), nor can it require additional per-object CPU overhead for rendering. In addition, such a method should not require excessive CPU/GPU communication.