In the field of computer graphics, rendering refers to the process of adding realism to computer graphics by adding three-dimensional (3D) qualities, such as shadowing, color variation, and shade, to objects displayed on a two-dimensional display. Modern 3D graphics processors are commonly used to render 3D graphical images for display. Rendering is typically accomplished by breaking the objects up into a series of primitives such as polygons, typically, triangles. For each pixel that makes up the polygon, attribute values are assigned for attributes such as color, specular color, alpha (transparency), fog, surface texture, and depth. The attribute values are then combined, or otherwise processed, by the graphics processor to achieve a precise pixel value for each pixel that makes up the object. To render a realistic image, often several processing operations, involving a variety of attributes, must be performed for each individual pixel that makes up the object or image.
A graphics processor is generally limited by the clock speed with which it can process the individual pixels of a 3D computer image. One way in which system designers have improved the efficiency of graphics processors is by designing processors to perform multiple pixel processing operations at the same time. For example, to increase the overall efficiency of graphics processors, system designers have developed multithreading, multi-core graphics processors. As the name suggests, a multithreading, multi-core graphics processor has multiple cores, or pixel processing units, that operate on pixels in parallel. Each core of the engine is directed to process pixels by a stream of instructions referred to as a thread. One of the advantages of a multithreading, multi-core approach to pixel processing is that the graphics processor can switch between threads if, for example, one thread is required to wait for pixel data to be fetched from main memory.
To further improve efficiency, many graphics processors have been designed with an on-chip cache to store pixel data that is being processed. In particular, if the on-chip cache contains the pixel data that is required for processing, the processing occurs more quickly because the graphics processor need not wait for the pixel data to be fetched from main memory every time it is required for a pixel processing operation. The reduced number of main memory accesses improves the overall efficiency of the graphics processor.
However, implementing a caching technique with a multithreading, multi-core graphics processor poses several challenges. Due to thread-switching in a multithreading, multi-core graphics engine, the order in which pixel processing operations are completed may be different than the order in which the threads are dispatched to the multi-core engine. For example, if two threads are dispatched to the multi-core engine to perform processing on the same pixel (e.g., each thread specifies the same x,y coordinates, representing the same pixel), due to thread-switching, the pixel processing operations are not guaranteed to be performed in the order in which the threads are dispatched to the multi-core engine. Consequently, if the pixel processing operations are performed out of order, the resulting pixel data may be incorrect.
Another problem is maintaining pixel data coherency for the cache and main memory. For example, after a pixel processing operation is performed and the resulting pixel value is written to the cache, the value for that particular pixel, as stored in the cache, will be different than the corresponding value stored in main memory. Consequently, a main memory read operation results in an improper pixel value being read.
One solution to these problems is to implement two separate cache controllers, for example, one cache controller in the graphics engine to maintain data coherency between the render-cache and main memory, and a separate controller to maintain the order in which threads are dispatched to the core. However, a problem with this prior art solution is the overall size of the logic required for implementing the solutions. For example, because each cache-controller has its own content addressable memory, the overall size of the logic (e.g., number of gates required to implement the solution) is larger than desired.