Field of the Invention
Embodiments of the present invention relate generally to computer graphics processing and, more specifically, to distributed index fetch, primitive assembly, and primitive batching.
Description of the Related Art
In computer graphics, software applications render two-dimensional (2D) and three-dimensional computer graphics (3D) objects for display by transmitting a draw command, also referred to as a draw call, via an application programming interface (API). Typically, when rendering a graphics object, a graphics driver receives the draw command from a particular software application via the API and transfers the draw command to a graphics processing unit (GPU) for rendering. The draw command includes an address that points to a location within an index buffer. The index buffer includes a list of pointers to vertices for the graphics primitives that make up the graphics object. The graphics primitives are typically points, line segments, triangles, quadrilaterals, or surface patches. Upon receiving the draw command from the graphics driver, the GPU draws the graphics object by rendering the graphics primitives associated with the vertices. Typically, the rendered graphics primitives are displayed on a display device.
Complex graphics objects can include thousands or millions of indices. In order to more efficiently render such complex objects, the GPU usually divides the indices into units of work, referred to as batches, and distributes the batches to individual graphics processing pipelines implemented within the GPU. Prior to distributing the batches, a primitive distributor within the GPU performs an index scan that analyzes each index in order to eliminate duplicate indices. Duplicate indices typically occur when graphics primitives are adjacent to one another. For example, a line segment could share a vertex with an adjacent line segment, while a triangle or quadrilateral could share two vertices with an adjacent triangle or quadrilateral, respectively. After eliminating duplicate vertices, the primitive distributor divides the remaining vertices into more or less equal-sized batches and distributes the batches to the individual graphics processing pipelines for further processing.
One drawback to the above approach is that the analysis needed to eliminate duplicate indices is computationally intensive. For example, if the primitive distributor were to analyze M indices simultaneously, and the primitive distributor were to compare each index with the preceding N index to search for duplicates, then the index analysis process would be M×N in computational intensity. As a result, the graphics processing pipelines implemented within a GPU can typically process batches at a faster rate than the primitive distributor can analyze indices and create new batches, creating a performance bottleneck.
Another drawback to the above approach is that the primitive distributor generally accesses M indices during every clock cycle, resulting in significant bandwidth impact on the memory system. For example, if the primitive distributor were to generate 4 primitives every clock and each primitive needs 3 indices, then the primitive distributor would access 4×3=12 indices per clock cycle. In order to increase the throughput of the primitive distributor, either or both of the number of indices analyzed simultaneously or the number of previous indices compared to each vertex. But increasing the number of indices analyzed simultaneously or the number of previous indices compared to each vertex increases the design complexity of the primitive distributor and further increases the demand on memory bandwidth, leading to further reductions in performance.
As the foregoing illustrates, what is needed in the art is more effective approach for distributing work in a GPU.