1. Field of the Invention
Embodiments of the present invention relate generally to computer hardware and more specifically to distributed vertex attribute fetch.
2. Description of the Related Art
The processing power of a modern central processing unit (CPU) may be supplemented using a co-processor, such as a graphics processing unit (GPU). Oftentimes, the GPU is used as a co-processor configured to process graphics data to generate pixels that are displayed on a screen.
Graphics data typically includes a plurality of vertices, where each vertex is a single data point in a three-dimensional (3D) model. Each vertex is defined by a set of attributes that describe the position and color of the vertex, among other things. Graphics data also typically includes a plurality of “primitive descriptions.” A primitive description identifies a set of vertices that represents a particular geometric shape. For example, a primitive description may identify three different vertices. Each vertex may represent different corners of one triangle. Other geometric shapes may also be represented by primitive descriptions. The geometric shape described by a primitive description is known as a “graphics primitive.” The GPU identifies the pixels associated with each graphics primitive and colorizes these pixels according to the vertices and the vertex attributes associated with the graphics primitive.
Vertices and graphics primitives are stored in memory within a vertex array and an index array. Each entry of the vertex array includes vertex attributes associated with the vertex. Each entry of the index array includes an index that points to an entry in the vertex array. The index array is organized so that sets of indices represent primitive descriptions. The GPU accesses the vertices and the primitive descriptions that are stored in the vertex array and the index array, respectively, and processes this data using a sequence of different processing components.
The components of the GPU that process the graphics data are collectively known as a “graphics processing pipeline.” A portion of a prior art graphics processing pipeline 10 is illustrated in FIG. 1. As shown, the graphics processing pipeline 10 includes a host 12, a front end 14, a data assembler 16, and a memory 18. Indices are stored in an index array 20 and vertices are stored in a vertex array 22 that are each located within memory 18. Each entry of vertex array 22 includes one or more attributes of a particular size. For example, each entry of vertex array 22 may include 8 attributes, where each attribute has a size of 4 bytes, and each vertex occupies 32 bytes of memory 18.
Data assembler 16 reads indices from the index array 20. Data assembler 16 then loads some or all of these indices into a sliding window buffer 24 located within data assembler 16. Data assembler 16 loads a particular index into sliding window buffer 24 when the index is not already stored in sliding window buffer 24. Accordingly, each index stored in sliding window buffer 24 is unique. When data assembler 16 reads an index from index array 20 that is not stored in sliding window buffer 24, the oldest entry in sliding window buffer 24 is removed to make room for the new entry. When a new entry is added to sliding window buffer 24, data assembler retrieves the vertex associated with the new entry and transmits the vertex attributes associated with the vertex to a selector 26.
Data assembler 16 also identifies primitive descriptions within the indices read from index array 20. Data assembler generates a translated version of each primitive description, where the translated version comprises pointers to locations in sliding window buffer 24. Data assembler 16 transmits each translated primitive description to selector 26.
Selector 26 receives vertex attributes and translated primitive descriptions from data assembler 16 and selects one of a plurality of shader multiprocessors (SMs) 28A-28N to process this data. Selector 26 typically employs a round-robin protocol to distribute data to each SM 28A-28N. Selector 26 transmits a preset amount of graphics data to a particular SM, then selects the next SM in a sequence to receive graphics data. Each SM 28A-28N includes a vertex shader 30 that colorizes pixels associated with vertex attributes and primitive descriptions received from selector 26. When a particular SM completes graphics data processing, the processed graphics data is output to an xbuffer 32 that collects data from SMs 28A-28N.
One disadvantage of this configuration is that each unique vertex and the associated vertex attributes are sequentially fetched from memory 18. This may cause a memory access bottleneck when numerous vertices need to be fetched. Another disadvantage is that data assembler 16 must be connected to selector 26 with a 256-bit wide bus so that an entire vertex may be transmitted to selector 26. Transmitting more than one vertex simultaneously to selector 26 would require that the bus size be increased. A further disadvantage is that data assembler 16 must process and transmit a 256 bit block of data each time a unique index is encountered. When numerous unique indices are encountered, the processing requirements of the data assembler 16 increase significantly.
Accordingly, there remains a need in the art for a more efficient way to process graphics data in a graphics processing pipeline.