The process of rendering two-dimensional images from three-dimensional scenes is commonly referred to as image processing. As the modern computer industry evolves image processing evolves as well. One particular goal in the evolution of image processing is to make two-dimensional simulations or renditions of three-dimensional scenes as realistic as possible. One limitation of rendering realistic images is that modern monitors display images through the use of pixels.
A pixel is the smallest area of space which can be illuminated on a monitor. Most modern computer monitors will use a combination of hundreds of thousands or millions of pixels to compose the entire display or rendered scene. The individual pixels are arranged in a grid pattern and collectively cover the entire viewing area of the monitor. Each individual pixel may be illuminated to render a final picture for viewing.
One technique for rendering a real world three-dimensional scene onto a two-dimensional monitor using pixels is called rasterization. Rasterization is the process of taking a two-dimensional image represented in vector format (mathematical representations of geometric objects within a scene) and converting the image into individual pixels for display on the monitor. Other techniques for rendering a real world three-dimensional scene onto a two-dimensional monitor using pixels have been developed based upon more realistic physical modeling. One such physical rendering technique is called ray tracing, which traces the propagation of imaginary rays, rays which behave similar to rays of light, into a three-dimensional scene which is to be rendered onto a computer screen. The rays originate from the eye(s) of a viewer sitting behind the computer screen and traverse through pixels, which make up the computer screen, towards the three-dimensional scene. Each traced ray proceeds into the scene and may intersect with objects within the scene. If a ray intersects an object within the scene, properties of the object and several other contributing factors are used to calculate the amount of color and light, or lack thereof, the ray is exposed to. These calculations are then used to determine the final color of the pixel through which the traced ray passed.
As image resolution and complexity continue to increase, the computational requirements of an image processing system likewise continue to increase. With continued improvements in semiconductor technology in terms of clock speed and an increased use of parallelism; however, rasterization becomes viable for more complex images, and real time rendering of scenes using physical rendering techniques such as ray tracing becomes a more practical alternative to rasterization. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Hardware-based pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
Irrespective of whether raster-based or physical rendering is performed to render image data for a scene, the increased use of parallelism presents some challenges with respect to maintaining a coherent state in a parallelized, multithreaded architecture. In many conventional multithreaded environments, for example, state data that is shared by multiple hardware-based threads, or threads of execution (as distinguished from time-sliced, software-based multithreading), is typically stored in a shared memory that is accessible by all of the threads of execution. The shared memory, for example, may be implemented using an on-chip DRAM array or using memory devices that are external from any processor chips.
In addition, caching may be used to accelerate the access to the shared state. With caching, one or more levels of smaller, yet faster memory arrays are interposed between the threads of execution and the shared memory to temporarily store copies of data in the shared memory, thereby accelerating the retrieval of data by threads of execution. Some cache memories may be shared by multiple threads of execution, while others, which often offer the lowest latency, may be tightly integrated with and exclusively owned by particular threads of execution.
In conventional caching environments, whenever a thread of execution attempts to access shared data stored in a shared memory, a copy of the shared data is copied into one or more levels of cache memory so that subsequent accesses to the data are made to the cache memory rather than the shared memory. So long as the data is not modified by any thread of execution, multiple copies of the data can be cached by multiple threads of execution. Should the data be modified by any particular thread of execution, a coherence protocol, typically using either a coherence directory or snooping, is typically used to invalidate other copies of the data in other threads of execution. When the other threads attempt to access the data again, the modified data is written back to the shared memory, and in some instances, sent directly from the prior owner of the data to a requesting thread through a process known as intervention.
In highly multithreaded environments incorporating a shared memory, however, propagating changes to state data can be highly inefficient and significantly slow throughput. In many such environments, the interface to the shared memory has a limited bandwidth, and due to the high numbers of threads of execution that may need to use shared data, any changes to that data may result in tens or hundreds of threads attempting to access the same data at the same time, which can cause the interface with the shared memory to become a significant bottleneck, and in some cases, further overload the interface with coherency-related communications as those tens or hundreds of threads attempt to maintain coherency with one another. In addition, shared state data in some instances can be somewhat large, e.g., on the order of several kilobytes of memory, so forwarding complete copies of shared state data can also have a significant adverse impact on communications and memory bandwidth.
Similar problems may also exist in other highly multithreaded environments, including those used in applications other than image processing. A need therefore exists in the art for an improved manner of maintaining coherent state data in highly multithreaded environments.