1. Field of the Invention
The present invention generally relates to computer hardware and more specifically to distributed clip, cull, viewport transform and perspective correction.
2. Description of the Related Art
The processing power of a modern central processing unit (CPU) may be supplemented using a co-processor, such as a graphics processing unit (GPU). Oftentimes, the GPU processes graphics data in order to generate pixels that are displayed on a screen.
Conventional GPUs process graphics data using a sequence of components that each implements a different processing stage. The sequence of components is known as a “graphics processing pipeline.” A portion of a prior art graphics processing pipeline 10 is illustrated in FIG. 1. As shown, graphics processing pipeline 10 includes a host 12, a front end 14, a data assembler 16, texture processing clusters (TPCs) 18-1 through 18-n, a data bus 20, a viewport culling engine (VPC) 22, a data bus 24, and a setup 26.
Data assembler 16 receives commands and indices from front end 14. The indices point to graphics primitives stored in memory, while the commands specify graphics processing operations to be performed with those primitives. Data assembler 16 fetches the graphics primitives from memory and generates different “batches,” where each batch includes a subset of the graphics primitives. Data assembler 16 then distributes one or more batches to each TPC 18.
TPCs 18-1 through 18-n each colorizes pixels associated with the graphics primitives in a batch received from data assembler 16. TPCs 18-1 through 18-n then transmit the graphics primitives, along with the colorized pixels, to VPC 22 through data bus 20. VPC 22 accumulates all of the graphics primitives and all of the colorized pixels from TPCs 18-1 through 18-n and then performs clipping, culling, viewport transform, and perspective correction for all of the received graphics primitives. VPC 22 sends processed primitives to setup 26 via data bus 24. Setup 26 prepares the primitives to be rasterized.
One drawback of the architecture described above is that VPC 22 processes graphics primitives in the order with which data assembler 16 fetches those primitives. Consequently, when a given TPC 18 has not finished processing a particular graphics primitive, VPC 22 cannot process any graphics primitives that were fetched subsequently to the graphics primitive still being processed by the TPC 18, thereby causing a processing bottleneck. Another drawback of this architecture is that data bus 20 must transport all of the graphics primitives and colorized pixels to VPC 22. This architecture is not scalable because increasing the number of TPCs 18 could require data bus 20 to be increased to an impractical size. A third drawback is that data bus 24 must transport all of the processed graphics primitives from VPC 22 to setup 26. As with data bus 20, increasing the number of TPCs 18 could require data bus 24 to be increased to an impractical size.
Accordingly, there remains a need in the art for a more efficient and scalable way to perform clipping, culling, viewport transform, and perspective correction.