1. Field of the Invention
The present invention relates generally to the field of graphics processing and more specifically to a system and method for geometry shading.
2. Description of the Related Art
A typical computing system includes a central processing unit (CPU), a system memory, a graphics processing unit (GPU), a GPU local memory, a GPU on-chip memory, one or more display devices, and one or more input devices. The CPU usually executes the overall structure of the software application and configures the GPU to perform specific tasks in the graphics pipeline (the collection of processing steps performed to transform 3-D images into 2-D images). Some GPUs are capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units.
To fully realize the processing capabilities of advanced GPUs, GPU functionality may be exposed to graphics application developers through an application programming interface (API) of calls and libraries. Among other things, doing so enables graphics application developers to tailor their shading programs to optimize the way GPUs process graphics scenes and images. In some GPUs, the API may allow the hardware processing units to be configured as geometry shading engines that include multiple geometry shaders operating in parallel.
Typically, each geometry shader within a geometry shading engine is programmed to perform image-rendering operations on a single input primitive (group of vertices), such as a point, line, or triangle, to produce zero or more output primitives. Upon receiving an input primitive, each geometry shader launches one execution thread. Since multiple geometry shaders may be executing in parallel, the inputs and the outputs of the geometry shaders are stored in memory buffers to preserve the processing order of the primitives throughout the graphics pipeline. Typically, at least part of the geometry shader data buffers is stored in GPU on-chip memory. After all of the geometry shader threads across the different geometry shaders have completed, the output buffers of the geometry shaders are drained serially to transmit the geometry data to the next stage in the graphics pipeline.
For example, for use in cube-map rendering, a streaming multiprocessor that includes 32 streaming processors may be configured as a geometry shading engine that includes 32 parallel geometry shaders. If each of these geometry shaders receives an input triangle, then each of these geometry shaders executes one thread that processes the 6 cube faces and emits up to 6 output triangles, one for each face of the cube. In this case, the memory allocated to buffer the inputs of the geometry shaders must be able to store 96 vertices:
      (                  (                  32          ⁢                                          ⁢          geometry          ⁢                                          ⁢          shaders                )            *              (                  1          ⁢                                    input              ⁢                                                          ⁢              triangle                                      geometry              ⁢                                                          ⁢              shader                                      )            *              (                  3          ⁢                      vertices            triangle                          )              )    .Similarly, the memory allocated to buffer the outputs of the geometry shaders must be able to store 576 vertices:
      (                  (                  32          ⁢                                          ⁢          geometry          ⁢                                          ⁢          shaders                )            *              (                  6          ⁢                                    output              ⁢                                                          ⁢              triangles                                      geometry              ⁢                                                          ⁢              shader                                      )            *              (                  3          ⁢                      vertices            triangle                          )              )    .Furthermore, if it takes M cycles to process each cube face and store the resulting triangle in the output buffer, then the number of cycles for each geometry shader to fully execute its input triangle is:
      (                  (                  6          ⁢                                          ⁢          faces                )            *              (                  M          ⁢                                          ⁢                      cycles            face                          )              )    .cycles to fully execute its input triangle. And, before the geometry shaders may process the next 32 input primitives, the output buffers must be drained to the next stage in the pipeline. Therefore, the latency of the geometry shaders in this example is:
      (                  (                              (                          6              ⁢                                                          ⁢              faces                        )                    *                      (                          M              ⁢                                                          ⁢                              cycles                face                                      )                          )            +              (                  time          ⁢                                          ⁢          to          ⁢                                          ⁢          drain          ⁢                                          ⁢          output          ⁢                                          ⁢          buffers                )              )    .
The functionality of programmable geometry shaders allows the GPU to implement shading programs that might otherwise be mapped to the CPU. One drawback to using geometry shaders, however, is that the memory required to store both the input primitives and the output primitives of the geometry shaders is expensive because the on-chip memories take up valuable die area. Another drawback is that the latency of the geometry shaders may be large. One way to reduce the cost of the memory is to use less expensive off-chip memory, such as the GPU local memory. However, since accessing off-chip memory is slower than accessing on-chip memory, such a solution will further increase the latency of the geometry shaders and, thus, may cause the geometry shaders to become a bottleneck in the graphics pipeline. Another approach to storing the primitives associated with the geometry shaders is to use a combination of on-chip memory and off-chip memory. Again, the on-chip memory will be expensive, and accessing the off-chip memory may cause the geometry shaders to become a bottleneck in the graphics pipeline.
As the foregoing illustrates, what is needed in the art is a more effective technique for parallel geometry shader processing.