For the vast majority of applications, application programmers rely on or utilize some form of software interface for interactions between a host system, such as the host system of a computer, and its associated subsystems, such as a computer's graphics subsystem. For graphics applications, developers or programmers typically utilize a graphics software interface, such as a 3D graphics application programming interface (API), to facilitate the interaction with constituent parts of a graphics system. For instance, a developer might develop a graphics application that makes and receives calls to and from the graphics API in order to achieve some result pertaining to a graphics effect applied to graphics data. Programmers typically rely on software interfaces to graphics processing units (GPUs), peripherals and other specialized devices so that they can focus on the operational specifics of their application and the artistry of the graphics content rather than on the specifics of controlling a particular device or the algorithmic details associated with generating certain graphics objects or transforming those objects according to a particular effect. Programmers also rely on software interfaces so that their efforts are not duplicated from application to application, i.e., so that function calls or interfaces which are likely to be useful to multiple developers or likely to be applicable to various graphics scenarios, such as “Create Triangle,” “Fill in Object with a Specified Solid Color,” “Stretch/Scale Rectangle,” etc. can re-used. However, even after generations of software interfaces, there are certain aspects of today's software interfaces that can be improved.
Historically, graphics peripherals, integrated circuits (ICs) and other specialized graphics hardware designed for specific tasks, e.g., special purpose co-processing chips such as GPUs, have been better than the host processor of a host computing system at performing certain types of functions. For instance, video cards generally include special purpose hardware for copying and processing pixels and vertices faster than the central processing unit (CPU). So, historically, for a PC having a host system with a CPU and a graphics subsystem having a GPU, when any sort of graphics “thinking” was involved, the CPU handled the processing and when repetitive number crunching of large arrays of data was implicated, the GPU was called upon for processing. However, changes in graphics technology have occurred that have transformed the traditionally fixed function graphics pipeline into a more flexible entity.
For instance, hereby incorporated by reference, commonly assigned copending U.S. patent application Ser. No. 09/796,577, filed Mar. 1, 2001, entitled “Method and System for Defining and Controlling Algorithmic Elements in a Graphics Display System,” relates to systems and methods for enabling programmability of a 3D graphics chip, wherein programming or algorithmic elements written by the developer can be downloaded to the chip, thereby programming the chip to perform those algorithms. As described, a developer writes a routine representing algorithmic element(s), wherein the routine is downloadable to the 3D graphics chip and then downloads the algorithmic element(s) to the programmable chip. Alternatively, the developer chooses from a pre-existing set of algorithmic elements that are provided in connection with the API itself, or specifies the location of an otherwise existing routine. The routine adheres to a specific format for packing up the algorithmic element(s), or instructions, for implementation by the 3D graphics chip. In one embodiment, the developer packs the instruction set into an array of numbers, by referring to a list of ‘tokens’ understood by the 3D graphics chip. This array of numbers in turn is mapped correctly to the 3D graphics chip for implementation of the algorithmic element(s) by the 3D graphics chip. The architecture of the '577 application enables the developer to be flexible when defining the computation to be performed by the chip, while simultaneously allowing the developer to leverage the power and performance advantages provided by the 3D graphics chip.
Vertex and pixel shaders, which may be implemented with software or hardware or with a combination of both, are specialized components of a graphics subsystem that include specialized functionality for the processing of pixels, vertices, or other graphics data, so as to perform specialized operations, such as lighting and shading, and other transformations upon graphics data. In this regard, vertex and pixel shaders are two types of procedural shaders that have evolved to possess programmable functionality, e.g., as described in the '577 application.
Additional background relating to vertex and pixel shaders can be found in commonly assigned copending U.S. patent application Ser. No. 09/801,079, filed Mar. 6, 2001, entitled “API Communications for Vertex and Pixel Shaders,” hereby incorporated into the present disclosure by reference. Briefly, the '079 application is directed to a three dimensional (3-D) graphics application programming interface (API) that provides improved communications between application developers and hardware rendering devices, such as procedural shaders. In particular, the '079 application is directed to improved API communications for host interaction with procedural shaders, such as vertex and pixel shaders, having local registers. The API communications of the '079 application advantageously expose various on-chip graphical algorithmic elements, while hiding the details of the operation of vertex shaders and pixel shaders from the developer. Advantageously, the procedural shaders and corresponding communications do not access the main memory or stack on the host system, but rather perform their operations efficiently with respect to a set of local registers. For the particular graphical algorithmic elements exposed, the graphics subsystem and corresponding interfaces of the '079 application allow for an efficient instruction set with numerous performance advantages, including faster accessing and processing of data as a result of bypassing the host system memory or stack.
As is apparent from the above, advances in hardware, such as procedural shaders, and graphics interfaces and algorithms have been revolutionizing the way graphics platforms operate. Generally speaking, however, current 3D graphics chips on the market can still be made more flexible and efficient, i.e., room for improvement still exists, both with respect to vertex shaders and pixel shaders.
For instance, on the vertex shader side of the graphics pipeline, while programs, i.e., algorithmic element(s) packaged as tokenized set(s) of instructions, currently can be downloaded to a graphics chip, the flow represented by a program performed by the graphics chip must be static. While such static flow may include branches, the branches themselves are fixed and may not be predicated upon a characteristic only known at runtime, i.e., any branches that may currently exist in a program downloaded to a vertex shader are predicated upon pre-set constants, such that all data fed to the vertex shader is processed in exactly the same way until the corresponding program is unloaded.
For instance, as illustrated in FIG. 1A, a developer D (or a software application A) can specify a program P having exemplary instructions I1 to I5 to a graphics API GAPI for download to the graphics chip, such as vertex shader VS, in order to program the graphics chip to perform the algorithms represented by the program P. Once the program P is downloaded to vertex shader VS, however, i.e., once vertex shader VS is programmed with program P, graphics data processed by vertex shader VS must be processed according to the algorithms I1′ to I5′ represented by or corresponding to instructions I1 to I5. In this simple example, the goal of the illustrated program P is to process black pixels in one way (I1, I2 and I3), and white pixels in another way (I4 and I5). However, once the program is loaded into vertex shader VS, there can be no branching taking place upon a characteristic or variable of the runtime system which can be limiting.
While the ability to define a static process flow for all graphics data to be processed according to algorithms I1′ to I5′ on the graphics chip is beneficial, currently, the static definition must remain for the duration of processing according to program P, i.e., until the processing is stopped and another program providing a different static computational flow is downloaded to the vertex shader VS for further processing of graphics data.
As illustrated in FIG. 1B, represented by the arrows illustrating the computational flow process performed on the graphics data, the processing that occurs for each data point of the graphics data streamed through the graphics chip's execution engine is limited to the static flow of the following: if a constant C1 is “0”, vertex shader VS processes according to algorithm, instruction, or function I1′, followed by algorithm I2′ and followed by algorithm I3′ before being output; and if a constant C1 is “1”, vertex shader VS processes according to algorithm I4′ and algorithm I5′ before being output. In this regard, all of the graphics data must be processed in these static rules of process flow, i.e., some of the data cannot be processed according to different rules of process flow. More particularly, the graphics data cannot currently be processed according to dynamic branches of program P determined at runtime, e.g., an “If Then” or “If Then Else” command or structure based upon a runtime condition cannot be deployed in a program. Accordingly, it would be desirable to provide dynamic flow control for programs that are downloaded to a vertex shader VS, whereby a coprocessor can receive a program which thereby programs the coprocessor to dynamically process data in a particular way defined by the program, and wherein the coprocessor can process data differently according to different branches defined by the program. For instance, according to criteria specified in the program, it would be desirable to process some of the data streaming through the coprocessor according to a first algorithm depending upon a first condition that is set or discovered at runtime, and some of the data according to a second algorithm depending upon a second runtime condition or setting without recourse to downloading another program. It would be further desirable to enable branching to occur dynamically during the execution of a program that has been downloaded to a graphics chip to predicate control of the processing of graphics data on runtime characteristics or variables.
It is to be noted that the dichotomy of symbolic representation, e.g., I1 v. I1′, is used above when describing a program instruction versus its functional representation as a part of a program that has been downloaded to a graphics chip, respectively; however, one can appreciate that a program may be parsed and/or partially, quasi- or fully tokenized or compiled en route to the graphics chip as part of the download process to format the program for reception and use by the graphics chip. As a consequence, the process of tracing or finding definitive correspondence between a representation I1′ in the graphics chip and a source code instruction I1 may be amorphous. Moreover, where one instruction “ends” and another “begins” is not necessarily definitive, atomicity of operation may be defined in different ways, programs operate according to functional objectives, which can be divided into subsets of functional objectives, which can be divided into even smaller subsets of functional objectives, and so on. Thus, such symbolism for instructions has been used herein for conceptual or illustrative purposes.
FIGS. 2A and 2B collectively illustrate another point with respect to currently existing architectures that provide the ability to download a program, or algorithmic elements, to a programmable vertex shader in a graphics coprocessing subsystem. FIG. 2A illustrates a current architecture of a graphics API GAPI. FIG. 2A illustrates that graphics API GAPI, such as a 3D graphics API, generally includes many different interfaces for corresponding different reasons. For instance, as illustrated, graphics API GAPI includes program download object(s) or interface(s) DO for use in connection with (A) specifying program(s) to be downloaded to the vertex shader VS, e.g., by a developer or an executing software application A, (B) partially or wholly parsing and/or partially or wholly tokenizing and/or compiling the instructions of the specified program(s), taking into account whether or not the specified program(s) have already been parsed, tokenized, compiled, etc. (C) and transmitting the program(s) to the graphics coprocessing subsystem in a format for the vertex shader VS.
Graphics API GAPI also includes, however, many other objects and interfaces, such as external object(s) or interface(s) EO, which may be used in connection with, inter alia, initializing, setting or changing various storage elements, such as registers, located in the graphics coprocessing subsystem, e.g., in the vertex shader VS. Thus, as illustrated in FIG. 2B, an exemplary vertex shader VS includes at least (1) a storage bank for n constants C[0] to C[n−1], which are immutable (read only) during operation of the vertex shader VS, (2) a plurality of readable/writable input register storage elements I1 to Ik (e.g., for vertices, intermediate programming results, etc.) and (3) a plurality of readable/writable output register storage elements O1 to Om. Exemplary vertex shader VS may include other register storage elements for storing other kinds of variables and constants as well, whether readable and/or writable.
Because of how quickly the above described storage elements can be accessed by the execution engine EE of the vertex shader VS, a program loaded into vertex shader VS via download object(s) DO can also execute upon large quantities of data streamed through the execution engine EE very quickly. Constants C[0] to C[n−1] may be first set by the external objects EO in order to define the context into which program(s) are to be downloaded, and constants C[0] to C[n−1] can also be declared globally at the loading or instantiation of a program in the vertex shader VS for reference during operation of the program, although constants may not be altered or reset during operation of the program, e.g., while the execution engine EE processes a stream, or container, of graphics data. In this regard, as implied by the notation, C[0] to C[n−1], constants are capable of being referenced by index with programming commands. An exemplary command that indexes a constant is the command “mov r0, C[3],” which when executed moves the value r0 into constant storage location C[3]. However, presently, no readable and writable storage element in vertex shader VS may be referenced by index, i.e., a “mov r0, I2” or a “mov r0, O7” command can be executed, but the equivalent “mov r0, I[1]” or “mov r0, O[6]” commands using an index into the array of input and output registers can not be performed. The registers I1 to Ik and O1 to Om are individually and independently addressable only. Thus, a program cannot currently index readable/writable input and output registers of a vertex shader VS. This would be particularly desirable and provide more vertex shader flexibility for a variety of reasons, including, but not limited to, achieving looping or recursive behavior within a program downloaded to the vertex shader VS.
FIG. 3A illustrates an exemplary conventional texture mapping process wherein complex three dimensional (3-D) objects, or portions thereof, can be represented by collections of adjacent triangles (“a mesh”) representing the approximate geometry of the 3-D object, or by a geometry map, or surface, in two dimensional (2-D) surface space. One or more texture maps can be mapped to the surface to create a textured surface according to a texture mapping process. In a conventional graphics system, the surface geometry sampling happens before texture sampling. In this regard, signals textured over a surface can be very general, and can specify any sort of intermediate result that can be input to a shader procedure to produce a final color associated with a point sample, and thus need not specify a function of color or grey scale values.
After texture sampling, additional transformations optionally can be applied to the textured surface prior to rendering the image with picture elements (pixels) of a display device. Images in computer graphics are represented as a 2-D array of discrete values (grey scale) or as three 2-D arrays of discrete values (color). Using a standard (x, y, z) rectangular coordinate system, a surface can be specified as a mesh (e.g., triangle mesh) with an (x, y, z) coordinate per mesh vertex, or as a geometry map in which the (x, y, z) coordinates are specified as a rectilinear image over a 2D (u, v) coordinate system, sometimes called the surface parameterization domain. Texture map(s) can also be specified with the (u, v) coordinate system.
Point samples in the surface parametrization domain, where signals have been attached to the surface, including its geometry, can be generated from textured meshes or geometry maps. These samples can be transformed and shaded using a variety of computations. At the end of this transformation and shading processing, a point sample includes (a) positional information, i.e., an image address indicating where in the image plane the point maps to and (b) textured color, or grey scale, information that indicates the color of the sample at the position indicated by the positional information. Other data, such as depth information of the point sample to allow hidden surface elimination, can also be included. The transformed, textured surface is placed in a frame buffer prior to being rendered by a display in 2-D pixel image space (x, y). At this point, in the case of a black and white display device, each (x, y) pixel location in 2-D image space is assigned a grey value in accordance with some function of the surface in the frame buffer. In the case of a typical color display device, each (x, y) pixel location in 2-D image space is assigned red, green and blue (RGB) values. It is noted that a variety of color formats other than RGB exist as well.
In order to render the surface on the display device itself, conventionally, the textured surface is sampled at positions that reflect the centers of the pixels of the device on which the image is to be displayed. This sampling may be performed by evaluating a function of the transformed, textured surface, at points that correspond to the center of each pixel, by mapping the centers of the pixels back into texture space to determine the point sample that corresponds to the pixel center.
Having described an exemplary texture mapping process, FIG. 3B illustrates that present graphics coprocessing subsystem architectures do not accommodate the storage and manipulation of texture maps in video memory by a vertex shader VS. Presently, vertex buffer VB, the video memory allocated for use with the vertex shader VS, can store whole sets of integers, which is suitable for processing of vertex, or positional information. Vertex buffer VB is thus well suited for storing positional information associated with vertices of a geometry map, but vertex buffer VB cannot presently store, or output, float data precise enough to represent a texture map meeting the requirements for today's graphics pipelines and output devices. Since vertex shader VS generally operates with respect to vertex data, i.e., positional information, sufficient precision to handle the colorization requirements of a texture map has generally not been a concern at the vertex shading stage. Thus, vertex buffer VB does not presently support float data inputs or outputs. However, there are a variety of operations and transformations that can be applied at the vertex shading stage for which float precision would be desirable. More particularly, 32 bit float precision would be desirable for supporting texture storage and processing by vertex shader VS, in keeping with the evolution of the graphics pipeline including the appearance of high precision monitors that have support for 10 bit rasterization, as opposed to a conventional 8 bit rasterization, processes.
It would be further desirable to increase the number of registers available on a vertex shader for use by a vertex shader during operation as input, output, intermediate and other special purpose storage. For instance, a program downloaded to the vertex shader could benefit from increased amount of register storage available on the vertex shader for more variables, temporary storage, outputs, etc. Presently, the number of register storage elements in a vertex shader VS is limited to 12.
It would be further desirable to increase the number of instructions that can be accommodated in a program to be downloaded to a vertex shader. Presently, the number of instructions that can be downloaded as a program to a vertex shader via the 3D graphics API is 96. One of ordinary skill can appreciate that the complexity of algorithms to be performed by the vertex shader VS is limited by this limit of instructions. Thus, it would be desirable to raise the bar from the current maximum number of instructions that can be packaged for execution by a vertex shader VS.
With respect to the processing of multiple vertex streams simultaneously, prior art vertex shaders are invoked once per vertex, i.e., with every invocation of the vertex shader, the input registers are initialized with unique vertex elements from the incoming vertex streams. Thus, as illustrated in FIG. 4, with older shader models, a vertex data point is input from each of vertex data streams VDS1 and VDS2 to load the input registers I1 to Ik for each “cycle” of the vertex shader VS. While the processing of multiple vertex data streams, or containers, simultaneously in parallel is advantageous in its own right, not all algorithms are well suited to processing parallel data streams by processing a data point from each data stream upon each operational cycle of the vertex shader VS. For instance, at the cycle designated by start time t1, vertex data point V1 and W1 are input to the appropriate input registers of vertex shader VS for processing and corresponding output. At the start of the next cycle, at time t2, vertex data point V2 and W2 are input to the appropriate input registers of vertex shader VS for processing and corresponding output, and so on. However, currently, a program downloaded to vertex shader VS can not process two vertex data points from vertex data stream VDS1, then process one vertex data point from vertex data stream VDS2, then two from VDS1, then one from VDS2, and so on, repetitively. Thus, it would be desirable to provide support for division of inputs from multiply specified data streams for processing by the vertex shader VS. More particularly, when multiple data streams are input to a vertex shader VS, it would be desirable to specify frequencies for each data stream input which determine how often data from each respective stream is input to the vertex shader VS.
On the pixel shader side of the graphics pipeline, there are also several ways in which improvement may be achieved. Similar to vertex shaders, for instance, a program can be downloaded to current pixel shaders via a graphics API for execution by the pixel shader generally for specialized operations on pixels. In this regard, the number of local registers provided for use with the pixel shader and the maximum number of instructions that may be provided for a program downloaded to the pixel shader currently limit the complexity and sophistication of operation that can be achieved with a downloaded program. The number of local registers currently available for use in connection with operation of a pixel shader is 6-12 and the maximum number of instructions that a program may include if it is to be downloaded to a pixel shader is 256. Thus, it would be desirable to increase the number of local registers provided on a pixel shader. It would also be desirable to increase the maximum number of instructions that may be associated with a program to be downloaded to a pixel shader.
FIG. 5A illustrates an exemplary conventional configuration of a graphics API with respect to a vertex shader VS, a setup engine SE and a pixel shader PS. Setup engine SE conventionally is used to, as the name implies, setup data for processing by the pixel shader PS in some fashion. For instance, the data from vertex shader VS may be clipped, or formatted for pixel processing, or the span may be setup. Currently, however, there is no way to specify to the pixel engine of pixel shader PS that an incoming pixel data point is a frontward facing pixel or a backward facing pixel, e.g., to achieve different effects for the front face of a triangle as opposed to the back face of a triangle. As a result, as conceptually illustrated in FIG. 5B, pixels p1, p2, p3 appear exactly the same for the front of triangle T as they do for the back of triangle T. Thus, it would be desirable to include the ability to specify whether a pixel is frontward facing or backward facing for use in connection with a pixel shader PS. It would be further desirable to provide a register on the pixel shader PS for storage of such “face” information during pixel processing.
Additionally, the pixel shader side of the graphics pipeline is even more limited than the vertex shader side with respect to branching, i.e., flow control, in programs. While programs, i.e., algorithmic element(s) packaged as tokenized set(s) of instructions, currently can be downloaded to a pixel shader, the flow represented by a program performed by the pixel shader PS must be static, non-branched and not predicated upon characteristics that become known at runtime. Thus, for instance, as illustrated in FIG. 6A, a developer D (or a software application A) can specify a program P having exemplary instructions I1 to I5 to a graphics API GAPI for download to the graphics chip, such as pixel shader PS, in order to program the graphics chip to perform the algorithms represented by the program P. Once the program P is downloaded to pixel shader PS, i.e., once pixel shader PS is programmed with program P, graphics data processed by pixel shader PS must be processed according to the algorithms I1′ to I5′ represented by or corresponding to instructions I1 to I5. However, presently, there can be no branching taking place, whether based upon a characteristic or variable set or generated during operation of the runtime system or not.
While the ability to define a static non-branched process flow for all graphics data to be processed according to algorithms I1′ to I5′ on the pixel shader is beneficial, branching behavior is an important and powerful programming tool and thus it would be desirable to provide both static branching behavior based upon pre-set characteristics of the system, as well as dynamic branching behavior based upon runtime characteristics, for programs downloaded to pixel shaders.
As illustrated in FIG. 6B, represented by the arrows illustrating the computational flow process performed on the graphics data, the processing that occurs for each data point of the graphics data streamed through the pixel shader's execution engine is limited to being processed according to algorithm (or function) I1′, followed by being processed according to algorithm I2′, followed by being processed according to algorithm I3′, followed by being processed according to algorithm (or function) I4′, and lastly by being processed according to algorithm I5′ before being output. In this regard, all of the graphics data must be processed in this exact static sequence, i.e., some of the data cannot be processed according to a different sequence or branch. More particularly, the graphics data cannot currently be processed according to different branches of program P, e.g., an “If Then” or “If Then Else” command or structure cannot be deployed, and no different branches can be staticly defined prior to downloading the program either.
Accordingly, it would be desirable to provide both static and dynamic flow control for programs that are downloaded to a pixel shader, whereby a coprocessor can receive a program which thereby programs the coprocessor to process data according to branches and conditions defined by the program, and wherein the coprocessor can process data differently according to the different branches defined by the program. For instance, according to criteria specified in the program, it would be desirable to process some of the data streaming through the coprocessor according to a first algorithm dependent upon the presence of a pre-set constant, or variable set or generated at runtime, and some of the data according to a second algorithm without recourse to downloading another program. In short, it would be desirable to enable branching to occur during the execution of a program once downloaded to the pixel shader to predicate control of the processing of graphics data on preset or runtime characteristics or variables.