Recent advances in computer performance have enabled graphic systems to provide more realistic graphical images using personal computers, home video game computers, handheld devices, and the like. In such graphic systems, a number of procedures are executed to “render” or draw graphic primitives to the screen of the system. A “graphic primitive” is a basic component of a graphic picture, such as a point, line, polygon, or the like. Rendered images are formed with combinations of these graphic primitives. Many procedures may be utilized to perform 3-D graphics rendering.
Specialized graphics processing units (e.g., GPUs, etc.) have been developed to optimize the computations required in executing the graphics rendering procedures. The GPUs are configured for high-speed operation and typically incorporate one or more rendering pipelines. Each pipeline includes a number of hardware-based functional units that are optimized for high-speed execution of graphics instructions/data. Generally, the instructions/data are fed into the front end of the pipeline and the computed results emerge at the back end of the pipeline. The hardware-based functional units, cache memories, firmware, and the like, of the GPU are optimized to operate on the low-level graphics primitives and produce real-time rendered 3-D images.
In modern real-time 3-D graphics rendering, the functional units of the GPU need to be programmed in order to properly execute many of the more refined pixel shading techniques. These techniques require, for example, the blending of colors into a pixel in accordance with factors in a rendered scene which affect the nature of its appearance to an observer. Such factors include, for example, fogginess, reflections, light sources, and the like. In general, several graphics rendering programs (e.g., small specialized programs that are executed by the functional units of the GPU) influence a given pixel's color in a 3-D scene. Such graphics rendering programs are commonly referred to as shader programs, or simply shaders. In more modern systems, some types of shaders can be used to alter the actual geometry of a 3-D scene (e.g., Vertex shaders) and other primitive attributes.
In a typical GPU architecture, each of the GPU's functional units is associated with a low level, low latency internal memory (e.g., register set, etc.) for storing instructions that programmed the architecture for processing the primitives. The instructions typically comprise a shader programs and the like. The instructions are loaded into their intended GPU functional units by propagating them through the pipeline. As the instructions are passed through the pipeline, when they reach their intended functional unit, that functional unit will recognize its intended instructions and store them within its internal registers.
Prior to being loaded into the GPU, the instructions are typically stored in system memory. Because the much larger size of the system memory, a large number of shader programs can be stored there. A number of different graphics processing programs (e.g., shader programs, fragment programs, etc.) can reside in system memory. The programs can each be tailored to perform a specific task or accomplish a specific result. In this manner, the graphics processing programs stored in system memory act as a library, with each of a number of shader programs configured to accomplish a different specific function. For example, depending upon the specifics of a given 3-D rendering scene, specific shader programs can be chosen from the library and loaded into the GPU to accomplish a specialized customized result.
The graphics processing programs, shader programs, and the like are transferred from system memory to the GPU through a DMA (direct memory access) operation. This allows GPU to selectively pull in the specific programs it needs. The GPU can assemble an overall graphics processing program, shader, etc. by selecting two or more of the graphics programs in system memory and DMA transferring them into the GPU.
There are problems with conventional GPU architectures in selectively assembling more complex graphics programs, shader programs, or the like from multiple subprograms. In general, it is advantageous to link two or more graphics programs together in order to implement more complex or more feature filled render processing. A problem exists however, in that in order to link multiple graphics processing programs together, the addressing schemes of the programs need to properly refer to GPU memory such that the two programs execute as intended. For example, in a case where two shader programs are linked to form a longer shader routine, the first shader address mechanism needs to correctly reference the second shader address mechanism. Additionally, both shader address mechanisms need to properly and coherently referred to the specific GPU functional units and/or registers in which they will be stored. This can involve quite a bit of overhead in those cases where there are many different graphics programs stored in system memory and a given application wants to be able to link multiple programs in a number of different orders, combinations, total lengths, and the like.
The programs in system memory have no way of knowing the order in which they will be combined, the number of them there will be in any given combination, or whether they will be combined at all. Due to the real time rendering requirements, the configurations of the combinations need to be determined on-the-fly, and need to be implemented as rapidly as possible in order to maintain acceptable frame rates. It is still desirable to DMA transfer the programs from the system memory to the GPU (e.g., on an as needed basis). In order to facilitate DMA transfers, the desired programs need to be modified to properly point to their respective correct addresses and to properly order themselves for execution with the various functional units of the GPU. Unfortunately, this results in a large number of read-modify-write operations (e.g., R-M-W), where the program must be read, their address mechanisms altered such that the individual instructions comprising each program correctly match their intended functional units and registers, and written back to system memory. Only after the required R-M-W operations have been completed can the desired programs be DMA transferred into the GPU. This results in a large amount of undesirable processor overhead.
The increased overhead proves especially problematic with the ability of prior art 3-D rendering architectures to scale to handle the increasingly complex 3-D scenes of today's applications. Scenes now commonly contain hundreds of programs each consisting of up to hundreds of instructions. Thus, a need exists for program loading process that can scale as graphics application needs require and provide added performance without incurring penalties such as increased processor overhead.