The present disclosure relates to computer devices and graphics processing.
Currently, in graphics, before you render a geometric primitive for a computer game, a list is built indicating which mesh instances and primitives are visible. As such, computer games are increasingly pre-processing mesh instances and primitives to determine their visibility as a pre-process prior to executing the normal graphics processing unit (GPU) rendering pipeline. To do this computer games must write out a per frame list of mesh instance data and primitives to process.
The visibility testing and compaction of per instance data is typically done on the central processing unit (CPU). Generally, a compute shader on the GPU may determine whether a primitive is visible, generally these are triangles. Each GPU thread processes a single triangle, testing to ensure whether the triangle is front facing to the viewer, not off screen, not zero area, and/or not occluded. If the triangle is visible, then each thread writes out the triangle's three vertex indices. It also increments a total number of triangles to render.
This GPU's rendering pipeline requires either 32 bit or 16 bit vertex indices. When not using triangle pre-culling, 16 bit is the most commonly used due to the compact storage. Compute shaders in the culling phase, however, are typically incapable of writing just 16 bits of memory. Therefore, if 16 bit indices are to be used, implementations pre-zero the output index buffer (which has a cost), and then an atomic-OR the high or low 16 bit of the vertex index which will share a 32 bit word with a triangle index written by a different GPU thread. The thread may pack the other two vertex indices into a single 32 bit write. Each thread therefore has to work out which of the first or last vertex index is the one which requires an atomic-OR, and pack the three indices into the correct 16 bits. Because of this complexity and the use of atomic operations, many implementations chose to use 32 bit indices, but this increases the memory cost and bandwidth requirement.
Once the culling process is complete, the GPU can move on to the traditional rendering phase. Due to non-determinism on the GPU, different groups of threads process different triangles at different times. This results in a re-ordering of the original triangle lists, which may negatively impact vertex cache hits during the subsequent GPU rendering phase.
Thus, there is a need in the art for improvements in visibility determinations.